Recover alexastrum contributions from GitHub PR #68 (lost during mirror sync)
6 claims + 1 source originally merged Mar 9 via GitHub squash merge. Forgejo→GitHub mirror overwrote GitHub main, erasing these files. Recovered from unreachable commit 9bd6c77c before GitHub GC. Added sourcer: alexastrum attribution to claim frontmatter.
This commit is contained in:
parent
6d8ae9878f
commit
dba00a7960
7 changed files with 348 additions and 0 deletions
|
|
@ -0,0 +1,42 @@
|
||||||
|
---
|
||||||
|
type: claim
|
||||||
|
domain: ai-alignment
|
||||||
|
secondary_domains: [collective-intelligence]
|
||||||
|
description: "Mnemom's 0-1000 trust scale with Ed25519 signatures and STARK zero-knowledge proofs provides the first cryptographically verifiable agent reputation system, enabling CI gating on trust scores and predictive detection of feedback system degradation."
|
||||||
|
confidence: speculative
|
||||||
|
source: "Alex — based on Compass research artifact analyzing Mnemom agent trust system (2026-03-08)"
|
||||||
|
sourcer: alexastrum
|
||||||
|
created: 2026-03-08
|
||||||
|
---
|
||||||
|
|
||||||
|
# Cryptographic agent trust ratings enable meta-monitoring of AI feedback systems because persistent auditable reputation scores detect degrading review quality before it causes knowledge base corruption
|
||||||
|
|
||||||
|
A feedback system that validates knowledge claims needs a meta-feedback system that validates the validators. Without persistent reputation tracking, a reviewer agent that gradually accepts lower-quality claims — due to model drift, prompt degradation, or adversarial manipulation — degrades the knowledge base silently.
|
||||||
|
|
||||||
|
**Mnemom** provides the first production-ready implementation of cryptographic agent trust. The system assigns trust ratings on a 0-1000 scale with AAA-through-CCC grades. Team ratings weight five components: team coherence history (35%), aggregate member quality (25%), operational track record (20%), structural stability (10%), and assessment density (10%). Scores use Ed25519 signatures and STARK zero-knowledge proofs for tamper resistance, with a GitHub Action (`mnemom/reputation-check@v1`) for CI gating on trust scores.
|
||||||
|
|
||||||
|
The meta-monitoring capabilities this enables:
|
||||||
|
|
||||||
|
1. **Trend detection**: Weekly trust score snapshots reveal whether a reviewer agent's quality is improving, stable, or degrading. A declining trend triggers investigation before knowledge base quality degrades noticeably.
|
||||||
|
|
||||||
|
2. **Comparative calibration**: When multiple reviewer agents evaluate the same claims, trust score divergence signals that one reviewer has drifted from the collective standard.
|
||||||
|
|
||||||
|
3. **Predictive guardrails**: Historical trust data enables proactive intervention. An agent whose trust score drops below a threshold can be automatically suspended from review duties pending investigation.
|
||||||
|
|
||||||
|
4. **CI integration**: The GitHub Action enables gating PR merges on the reviewing agent's trust score — claims reviewed only by low-trust agents cannot merge, requiring escalation to higher-trust reviewers or human approval.
|
||||||
|
|
||||||
|
5. **Zero-knowledge attestation**: STARK proofs enable agents to prove their trust rating exceeds a threshold without revealing the exact score or the underlying data, preserving competitive dynamics while enabling trust-gated access.
|
||||||
|
|
||||||
|
The cryptographic component is essential, not optional. Without tamper-proof scores, an adversarial agent could manipulate its own reputation. Ed25519 signatures ensure scores are issued by the trust authority, and STARK proofs ensure verification without score disclosure.
|
||||||
|
|
||||||
|
For a knowledge base specifically, meta-monitoring addresses a failure mode that other oversight mechanisms miss: the slow degradation of review quality. Schema validation catches malformed claims. Adversarial probing catches specific errors. But only persistent reputation tracking catches the systemic pattern of a reviewer approving increasingly marginal claims over weeks or months.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
Relevant Notes:
|
||||||
|
- [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]] — meta-monitoring detects when oversight quality is degrading, enabling intervention before it fails completely
|
||||||
|
- [[emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive]] — trust rating degradation may be the observable signal of emergent reviewer misalignment
|
||||||
|
- [[an aligned-seeming AI may be strategically deceptive because cooperative behavior is instrumentally optimal while weak]] — cryptographic trust scores provide an external check that is harder to game than behavioral observation alone
|
||||||
|
|
||||||
|
Topics:
|
||||||
|
- [[domains/ai-alignment/_map]]
|
||||||
|
|
@ -0,0 +1,46 @@
|
||||||
|
---
|
||||||
|
type: claim
|
||||||
|
domain: ai-alignment
|
||||||
|
description: "Layering pre-commit hooks, CI validation, YARA signature scanning, Cedar policy evaluation, LLM semantic review, and human approval creates a validation stack where each layer catches different failure modes and the deny-overrides principle ensures no single-layer bypass compromises the system."
|
||||||
|
confidence: experimental
|
||||||
|
source: "Alex — based on Compass research artifact analyzing Sondera's three-subsystem architecture and the seven honest feedback loop principles (2026-03-08)"
|
||||||
|
sourcer: alexastrum
|
||||||
|
created: 2026-03-08
|
||||||
|
---
|
||||||
|
|
||||||
|
# Defense in depth for AI agent oversight requires layering independent validation mechanisms because deny-overrides semantics ensure any single layer rejection blocks the action regardless of other layers
|
||||||
|
|
||||||
|
A single validation mechanism — no matter how sophisticated — has blind spots. Sondera's reference monitor demonstrates the defense-in-depth principle by combining three independent guardrail subsystems: a YARA-X signature engine for deterministic pattern matching (prompt injection, data exfiltration, secrets), an LLM-based policy model for probabilistic content classification, and an information flow control layer for sensitivity labeling. All signals feed into Cedar policies where a single matching `forbid` overrides any `permit`.
|
||||||
|
|
||||||
|
The deny-overrides principle is architecturally critical. In a system where multiple independent validators each return approve/deny decisions, two composition semantics are possible: any-approve (optimistic — action proceeds if any validator approves) or any-deny (pessimistic — action blocks if any validator denies). For safety-critical systems, any-deny is the correct choice because it means an attacker must bypass *every* layer simultaneously rather than finding one permissive layer.
|
||||||
|
|
||||||
|
Applied to a multi-agent knowledge base, the defense-in-depth stack includes:
|
||||||
|
|
||||||
|
1. **Pre-commit schema validation** (local, deterministic) — catches malformed files before they enter version control
|
||||||
|
2. **CI validation via Forgejo Actions** (server-side, deterministic) — catches `--no-verify` bypasses and ensures validation runs even when agents skip local hooks
|
||||||
|
3. **YARA signature scanning** (deterministic) — pattern-matches for known misinformation patterns, exfiltration attempts, and injected content
|
||||||
|
4. **Cedar policy evaluation** (deterministic) — enforces structural constraints: who can modify what, required approvals, step-count limits
|
||||||
|
5. **LLM-based semantic review** (probabilistic) — evaluates content quality, checks evidence strength, assesses whether claims meet intellectual standards
|
||||||
|
6. **Human approval** (final gate) — catches everything the automated layers miss
|
||||||
|
|
||||||
|
Each layer operates on different information:
|
||||||
|
- Layers 1-2 see file structure
|
||||||
|
- Layer 3 sees content patterns
|
||||||
|
- Layer 4 sees agent identity and action context
|
||||||
|
- Layer 5 sees semantic meaning
|
||||||
|
- Layer 6 sees everything through human judgment
|
||||||
|
|
||||||
|
The independence of these layers is what makes the system robust. A prompt injection attack might fool layer 5 (LLM semantic review) but cannot fool layer 3 (YARA signatures) or layer 4 (Cedar policies). A novel attack pattern might evade layer 3 (YARA) but be caught by layer 5 (LLM review). Only an attack that simultaneously bypasses all six layers succeeds — and each additional independent layer exponentially reduces the probability of total bypass.
|
||||||
|
|
||||||
|
This maps directly to the alignment insight that [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]]. No single oversight mechanism is reliable enough on its own. But layered oversight where each mechanism is independently operated and uses deny-overrides composition can achieve reliability that no individual layer provides.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
Relevant Notes:
|
||||||
|
- [[deterministic policy engines operating below the LLM layer cannot be circumvented by prompt injection making them essential for adversarial-grade AI agent control]] — deterministic layers (1-4) provide the unforgeable foundation; probabilistic layers (5-6) provide semantic depth
|
||||||
|
- [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]] — defense in depth compensates for individual layer degradation by ensuring multiple independent checks
|
||||||
|
- [[safe AI development requires building alignment mechanisms before scaling capability]] — the validation stack should be in place before agents are trusted with autonomous knowledge base contributions
|
||||||
|
- [[knowledge validation requires four independent layers because syntactic schema cross-reference and semantic checks each catch failure modes the others miss]] — the four-layer validation model applied specifically to knowledge files
|
||||||
|
|
||||||
|
Topics:
|
||||||
|
- [[domains/ai-alignment/_map]]
|
||||||
|
|
@ -0,0 +1,33 @@
|
||||||
|
---
|
||||||
|
type: claim
|
||||||
|
domain: ai-alignment
|
||||||
|
description: "Sondera's Cedar/YARA reference monitor demonstrates that intercepting agent actions at the execution layer — not the prompt layer — provides guardrails that prompt injection cannot bypass, establishing a fundamental architectural distinction for AI safety infrastructure."
|
||||||
|
confidence: experimental
|
||||||
|
source: "Alex — based on Compass research artifact analyzing Sondera (sondera-ai/sondera-coding-agent-hooks), Claude Code hooks, and the broader agent control ecosystem (2026-03-08)"
|
||||||
|
sourcer: alexastrum
|
||||||
|
created: 2026-03-08
|
||||||
|
---
|
||||||
|
|
||||||
|
# Deterministic policy engines operating below the LLM layer cannot be circumvented by prompt injection making them essential for adversarial-grade AI agent control
|
||||||
|
|
||||||
|
Two fundamentally different paradigms exist for controlling AI agent behavior, and understanding this distinction is essential for building trustworthy multi-agent systems.
|
||||||
|
|
||||||
|
**Advisory systems** inject rules into the LLM's context window but cannot enforce compliance. Cursor's `.cursor/rules/*.mdc` files, Windsurf's `.windsurf/rules/*.md` files, Aider's `CONVENTIONS.md`, and the emerging AGENTS.md cross-tool standard all operate at this level. They guide behavior through prompt engineering — useful for coding style preferences but insufficient for security-critical validation. The fundamental limitation: advisory rules can be ignored or circumvented by prompt injection, model drift, or context window overflow.
|
||||||
|
|
||||||
|
**Deterministic systems** intercept execution programmatically and can block actions regardless of what the LLM intended. Sondera's reference monitor (released at Unprompted 2026) demonstrates the strongest form: a Rust-based harness using YARA-X signatures for pattern matching and Amazon's Cedar policy language for access control, intercepting every shell command, file operation, and web request made by Claude Code, Cursor, GitHub Copilot, and Gemini CLI. A single matching Cedar `forbid` overrides any `permit` — the deny-overrides semantics ensure that no prompt injection can authorize a blocked action.
|
||||||
|
|
||||||
|
The architectural point is structural, not about any particular tool. When the enforcement mechanism operates below the LLM — intercepting tool calls, file writes, and shell commands at the execution boundary — the LLM cannot reason its way past the constraint. This is the same principle that makes OS-level permissions more reliable than application-level access checks: the enforcement point is outside the entity being constrained.
|
||||||
|
|
||||||
|
Additional deterministic systems confirm the pattern: CrewAI's `@before_tool_call` / `@after_tool_call` decorators return `False` to block execution; LangChain 1.0's middleware provides `before_model`, `wrap_model_call`, and `after_model` hooks; AutoGen's `MiddlewareAgent` can short-circuit with direct replies; MCP's approval policies flag destructive operations.
|
||||||
|
|
||||||
|
The practical recommendation for any multi-agent knowledge system is to **layer both paradigms**: use advisory rules (AGENTS.md, CLAUDE.md) for convention sharing, while enforcing compliance through deterministic hooks, Cedar policies, and CI gates that cannot be bypassed by the agents they constrain.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
Relevant Notes:
|
||||||
|
- [[formal verification of AI-generated proofs provides scalable oversight that human review cannot match because machine-checked correctness scales with AI capability while human verification degrades]] — formal verification is another instance of deterministic oversight that does not degrade with capability gaps
|
||||||
|
- [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]] — advisory oversight degrades; deterministic enforcement does not
|
||||||
|
- [[capability control methods are temporary at best because a sufficiently intelligent system can circumvent any containment designed by lesser minds]] — deterministic policy engines are a partial counter: they constrain actions, not intelligence, and operate outside the system being constrained
|
||||||
|
|
||||||
|
Topics:
|
||||||
|
- [[domains/ai-alignment/_map]]
|
||||||
|
|
@ -0,0 +1,34 @@
|
||||||
|
---
|
||||||
|
type: claim
|
||||||
|
domain: ai-alignment
|
||||||
|
description: "A complete validation stack for markdown/YAML knowledge files combines syntactic validation (yamllint, markdownlint), schema validation (JSON Schema for frontmatter), cross-reference validation (wiki-link integrity), and semantic validation (SHACL for graph-level consistency), with each layer catching categorically different errors."
|
||||||
|
confidence: experimental
|
||||||
|
source: "Alex — based on Compass research artifact analyzing pre-commit, check-jsonschema, remark-lint-frontmatter-schema, pySHACL, and cross-reference tooling (2026-03-08)"
|
||||||
|
sourcer: alexastrum
|
||||||
|
created: 2026-03-08
|
||||||
|
---
|
||||||
|
|
||||||
|
# Knowledge validation requires four independent layers because syntactic schema cross-reference and semantic checks each catch failure modes the others miss
|
||||||
|
|
||||||
|
For a knowledge base built from markdown files with YAML frontmatter, validation operates at four levels of increasing semantic depth. Each level catches errors that are invisible to the others.
|
||||||
|
|
||||||
|
**Layer 1: Syntactic validation** catches malformed files. Yamllint enforces YAML style rules. `check-yaml` catches syntax errors. Markdownlint-cli2 enforces markdown formatting (53+ configurable rules). `trailing-whitespace` and `end-of-file-fixer` handle hygiene. These run on every commit locally via pre-commit hooks and in CI as a safety net against `--no-verify` bypasses. What this catches: broken YAML that would silently corrupt frontmatter parsing, inconsistent formatting that degrades readability, encoding issues.
|
||||||
|
|
||||||
|
**Layer 2: Schema validation** catches structurally valid but semantically incomplete files. `check-jsonschema` validates YAML frontmatter against JSON Schema definitions — enforcing required fields (`source`, `confidence`, `date`, `domain`), constraining confidence to valid ranges, restricting domains to controlled vocabularies, and validating date formats. `remark-lint-frontmatter-schema` handles the markdown-specific case of frontmatter embedded in `.md` files. What this catches: claims missing required metadata, confidence values outside valid ranges, domains that don't match the controlled vocabulary, dates in wrong formats.
|
||||||
|
|
||||||
|
**Layer 3: Cross-reference validation** catches files that are internally valid but externally inconsistent. This requires custom scripting: parse all knowledge files to build a claim ID index, verify that `[[wiki links]]` point to existing files, check that `supersedes`, `related_to`, and `contradicts` references are bidirectional where required, and detect orphaned claims with no incoming links. No off-the-shelf tool handles this for flat markdown files. What this catches: broken wiki links, one-directional relationships that should be bidirectional, orphaned claims disconnected from the knowledge graph.
|
||||||
|
|
||||||
|
**Layer 4: Semantic validation** catches graph-level inconsistencies invisible to file-level checks. If claims are converted to RDF triples, SHACL (W3C Shapes Constraint Language) validates the knowledge graph against shape constraints including property paths, cardinality, and transitive relationship chains. pySHACL supports RDFS/OWL reasoning before validation. What this catches: contradictions across claims (claim A says X, claim B says not-X, both marked as "likely"), violation of relationship integrity constraints (a claim supersedes a claim that was created after it), structural impossibilities in the knowledge graph.
|
||||||
|
|
||||||
|
The four layers are complementary, not redundant. A file can pass syntactic and schema validation perfectly while containing a broken wiki link (layer 3 catches it). A file can pass all three local layers while contradicting another claim in the knowledge base (layer 4 catches it). Defense in depth means each layer operates independently — a failure in one layer does not compromise the others.
|
||||||
|
|
||||||
|
The practical tradeoff: layers 1-2 are nearly free (standard pre-commit hooks). Layer 3 requires custom tooling but operates on flat files. Layer 4 requires an RDF conversion pipeline, adding significant complexity. The recommendation is to implement layers 1-3 immediately and layer 4 only when the knowledge base reaches a scale where graph-level inconsistencies become a practical problem.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
Relevant Notes:
|
||||||
|
- [[as AI-automated software development becomes certain the bottleneck shifts from building capacity to knowing what to build making structured knowledge graphs the critical input to autonomous systems]] — the validation stack ensures the knowledge graph that autonomous systems depend on is structurally sound
|
||||||
|
- [[formal verification of AI-generated proofs provides scalable oversight that human review cannot match because machine-checked correctness scales with AI capability while human verification degrades]] — schema and cross-reference validation are lightweight formal verification applied to knowledge files rather than mathematical proofs
|
||||||
|
|
||||||
|
Topics:
|
||||||
|
- [[domains/ai-alignment/_map]]
|
||||||
|
|
@ -0,0 +1,35 @@
|
||||||
|
---
|
||||||
|
type: claim
|
||||||
|
domain: ai-alignment
|
||||||
|
secondary_domains: [collective-intelligence]
|
||||||
|
description: "SWE-AF deploys 400-500+ agents across planning, coding, reviewing, QA, and verification roles scoring 95/100 versus 73 for single-agent Claude Code, demonstrating that multi-agent coordination with continual learning has moved from research to production."
|
||||||
|
confidence: experimental
|
||||||
|
source: "Alex — based on Compass research artifact analyzing SWE-AF, Cisco multi-agent PR reviewer, and BugBot (2026-03-08)"
|
||||||
|
sourcer: alexastrum
|
||||||
|
created: 2026-03-08
|
||||||
|
---
|
||||||
|
|
||||||
|
# Multi-agent git workflows have reached production maturity as systems deploying 400+ specialized agent instances outperform single agents by 30 percent on engineering benchmarks
|
||||||
|
|
||||||
|
The pattern of Agent A proposing via PR and Agent B reviewing has moved from research concept to production system. Three implementations demonstrate different aspects of maturity.
|
||||||
|
|
||||||
|
**SWE-AF (Agent Field)** deploys 400-500+ agent instances across planning, coding, reviewing, QA, and verification roles, scoring 95/100 on benchmarks versus 73 for single-agent Claude Code. Each agent operates in an isolated git worktree, with a merger agent integrating branches and a verifier agent checking acceptance criteria against the PRD. Critically, SWE-AF implements **continual learning**: conventions and failure patterns discovered early are injected into downstream agent instances. This is not just parallelization — the system gets smarter as it works.
|
||||||
|
|
||||||
|
**Cisco's multi-agent PR reviewer** demonstrates the specific reviewer architecture: static analysis and code review agents run in parallel, a cross-referencing pipeline (initializer → generator → reflector) iterates on findings, and a comment filterer consolidates before posting. Built on LangGraph, it includes evaluation tooling that replays PR history with "LLM-as-a-judge" scoring.
|
||||||
|
|
||||||
|
**BugBot** implements the most rigorous adversarial review pattern: a self-referential execution loop where each iteration gets fresh context, picks new attack angles, and requires file:line evidence for every finding. Seven ODC trigger categories must each be tested, and consensus voting between independent agents auto-upgrades confidence when two agents flag the same issue.
|
||||||
|
|
||||||
|
The 95 vs 73 performance gap is significant because it demonstrates that coordination overhead is more than compensated by specialization benefits. This is consistent with the general finding that [[coordination protocol design produces larger capability gains than model scaling because the same AI model performed 6x better with structured exploration than with human coaching on the same problem]] — the gains come from structuring how agents interact, not from making individual agents more capable.
|
||||||
|
|
||||||
|
The continual learning component is particularly important for knowledge base applications. In a knowledge validation pipeline, conventions and failure patterns discovered during early reviews (e.g., "claims about mechanism design require quantitative evidence") can be injected into downstream reviewer instances, creating an improving review process without human intervention.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
Relevant Notes:
|
||||||
|
- [[coordination protocol design produces larger capability gains than model scaling because the same AI model performed 6x better with structured exploration than with human coaching on the same problem]] — SWE-AF confirms this at production scale: coordination structure, not model capability, drives the performance gap
|
||||||
|
- [[AI agent orchestration that routes data and tools between specialized models outperforms both single-model and human-coached approaches because the orchestrator contributes coordination not direction]] — SWE-AF's merger and verifier agents are orchestration roles that contribute coordination
|
||||||
|
- [[tools and artifacts transfer between AI agents and evolve in the process because Agent O improved Agent Cs solver by combining it with its own structural knowledge creating a hybrid better than either original]] — SWE-AF's continual learning is this pattern at scale: conventions transfer and improve across instances
|
||||||
|
- [[centaur team performance depends on role complementarity not mere human-AI combination]] — role specialization (planner, coder, reviewer, QA, verifier) is why multi-agent outperforms single-agent
|
||||||
|
|
||||||
|
Topics:
|
||||||
|
- [[domains/ai-alignment/_map]]
|
||||||
|
|
@ -0,0 +1,43 @@
|
||||||
|
---
|
||||||
|
type: claim
|
||||||
|
domain: ai-alignment
|
||||||
|
secondary_domains: [collective-intelligence]
|
||||||
|
description: "Using separate Forgejo bot accounts for writer-agent and reviewer-agent, enforced by branch protection rules and Cedar policies that forbid self-merge, creates structural independence that advisory rules in AGENTS.md or CLAUDE.md cannot guarantee because the separation operates at the infrastructure layer."
|
||||||
|
confidence: experimental
|
||||||
|
source: "Alex — based on Compass research artifact analyzing Forgejo API patterns, branch protection, and the architectural separation principle (2026-03-08)"
|
||||||
|
sourcer: alexastrum
|
||||||
|
created: 2026-03-08
|
||||||
|
---
|
||||||
|
|
||||||
|
# Structurally separating proposer and reviewer agents across independent accounts with branch protection enforcement implements architectural separation that prompt-level rules cannot achieve
|
||||||
|
|
||||||
|
The honest feedback loop principle of architectural separation requires that the entity evaluating claims is structurally independent from the entity producing them. In a multi-agent knowledge base, this means the reviewer cannot be the same agent (or the same account, or the same process) as the proposer — and this constraint must be enforced by infrastructure, not by instructions.
|
||||||
|
|
||||||
|
**The advisory approach fails.** Telling an agent in its system prompt "do not approve your own PRs" is a convention that depends on the LLM following instructions. Prompt injection, model drift, context overflow, or simply a sufficiently capable agent reasoning about its objectives could circumvent this. AGENTS.md and CLAUDE.md files are advisory — they guide behavior but cannot enforce it.
|
||||||
|
|
||||||
|
**The infrastructure approach succeeds.** Forgejo (and GitHub) branch protection rules can require approval from a specific set of accounts before merge. If the proposer agent operates under account `writer-bot` and the reviewer operates under account `reviewer-bot`, and branch protection requires approval from `reviewer-bot`, then no prompt injection targeting `writer-bot` can bypass the review requirement. The enforcement point is in the git server, outside the LLM entirely.
|
||||||
|
|
||||||
|
Three mechanisms reinforce this separation:
|
||||||
|
|
||||||
|
1. **Separate bot accounts** — each agent authenticates with its own token, limiting what it can do via API permissions. The writer-bot token has push access but not merge access. The reviewer-bot token has review access.
|
||||||
|
|
||||||
|
2. **Branch protection rules** — the main knowledge branch requires N approvals from a defined set of reviewers. Direct pushes are blocked. Force pushes are blocked. This is enforced by the git server regardless of what any agent attempts.
|
||||||
|
|
||||||
|
3. **Cedar policies** — Sondera-style `forbid` rules can prevent the writer-bot from calling merge endpoints or from approving its own PRs, providing a second enforcement layer even if branch protection is misconfigured.
|
||||||
|
|
||||||
|
4. **Anti-recursion property** — Forgejo's automatic workflow token has a built-in anti-recursion rule: changes made with this token don't trigger new workflows. This prevents infinite loops in multi-agent pipelines but also means a single-token setup cannot implement true multi-agent review. Separate tokens for separate agents are required.
|
||||||
|
|
||||||
|
This pattern directly implements the principle that [[AI alignment is a coordination problem not a technical problem]]. The technical capability of the reviewer agent matters, but the structural independence of the review process matters more. A brilliant reviewer that shares an account with the proposer provides weaker guarantees than a mediocre reviewer on an independent account with infrastructure-enforced separation.
|
||||||
|
|
||||||
|
The analogy to financial auditing is precise: external auditors must be structurally independent from the companies they audit, not merely instructed to be objective. The instruction "be objective" is advisory. The SEC requirement for independent audit firms is architectural.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
Relevant Notes:
|
||||||
|
- [[deterministic policy engines operating below the LLM layer cannot be circumvented by prompt injection making them essential for adversarial-grade AI agent control]] — branch protection is a deterministic enforcement mechanism at the infrastructure layer
|
||||||
|
- [[AI alignment is a coordination problem not a technical problem]] — architectural separation is coordination infrastructure, not agent capability
|
||||||
|
- [[principal-agent problems arise whenever one party acts on behalf of another with divergent interests and unobservable effort because information asymmetry makes perfect contracts impossible]] — structural separation addresses the principal-agent problem between knowledge base and its agent contributors
|
||||||
|
- [[defense in depth for AI agent oversight requires layering independent validation mechanisms because deny-overrides semantics ensure any single layer rejection blocks the action regardless of other layers]] — architectural separation is one layer in the defense-in-depth stack
|
||||||
|
|
||||||
|
Topics:
|
||||||
|
- [[domains/ai-alignment/_map]]
|
||||||
|
|
@ -0,0 +1,115 @@
|
||||||
|
---
|
||||||
|
type: source
|
||||||
|
title: "Building honest multiagent knowledge bases on Forgejo"
|
||||||
|
author: "Compass (AI research artifact)"
|
||||||
|
url: null
|
||||||
|
date: 2026-03-08
|
||||||
|
domain: ai-alignment
|
||||||
|
format: report
|
||||||
|
status: processed
|
||||||
|
processed_by: "quaoar (contrib for Alex)"
|
||||||
|
processed_date: 2026-03-08
|
||||||
|
claims_extracted:
|
||||||
|
- "deterministic policy engines operating below the LLM layer cannot be circumvented by prompt injection making them essential for adversarial-grade AI agent control"
|
||||||
|
- "multi-agent git workflows have reached production maturity as systems deploying 400+ specialized agent instances outperform single agents by 30 percent on engineering benchmarks"
|
||||||
|
- "cryptographic agent trust ratings enable meta-monitoring of AI feedback systems because persistent auditable reputation scores detect degrading review quality before it causes knowledge base corruption"
|
||||||
|
- "knowledge validation requires four independent layers because syntactic schema cross-reference and semantic checks each catch failure modes the others miss"
|
||||||
|
- "defense in depth for AI agent oversight requires layering independent validation mechanisms because deny-overrides semantics ensure any single layer rejection blocks the action regardless of other layers"
|
||||||
|
- "structurally separating proposer and reviewer agents across independent accounts with branch protection enforcement implements architectural separation that prompt-level rules cannot achieve"
|
||||||
|
tags: [multiagent, forgejo, knowledge-validation, deterministic-policy, cedar, hooks, trust-systems, feedback-loops]
|
||||||
|
contributor: "Alex"
|
||||||
|
---
|
||||||
|
|
||||||
|
# Building honest multiagent knowledge bases on Forgejo
|
||||||
|
|
||||||
|
A practical multiagent knowledge base with honest feedback loops is not only feasible today — the tooling ecosystem has matured dramatically since mid-2025. The combination of **Forgejo's webhook/Actions infrastructure**, **Cedar/YARA-based deterministic hook systems** (like Sondera's), **Claude Code's 15-event hook lifecycle**, and emerging **agent trust systems** (like Mnemom) provides every architectural primitive needed to implement all seven honest feedback loop principles. The critical insight: prompt-level rules (Cursor, Windsurf, Aider) are advisory and bypassable, but deterministic policy engines operating below the LLM layer cannot be circumvented by prompt injection — making them essential for adversarial-grade knowledge validation.
|
||||||
|
|
||||||
|
## Sondera's reference monitor sets the architectural standard
|
||||||
|
|
||||||
|
The repository at `sondera-ai/sondera-coding-agent-hooks` is a **Rust-based reference monitor** for AI coding agents, released at Unprompted 2026. Written in **84.8% Rust** with YARA signatures and Cedar policies, it intercepts every shell command, file operation, and web request made by Claude Code, Cursor, GitHub Copilot, and Gemini CLI through dedicated adapter binaries.
|
||||||
|
|
||||||
|
The architecture follows a client-server model over Unix sockets: each agent's hook adapter normalizes tool-specific events into a common event taxonomy (`ShellCommand`, `FileRead`, `FileWrite`, `WebFetch`, `ToolCall`), then forwards them via **tarpc RPC** to a harness service that combines three guardrail subsystems. The **YARA-X signature engine** pattern-matches for prompt injection, data exfiltration, and secrets (fully deterministic). An **LLM-based policy model** classifies content against secure code generation categories (probabilistic, requires Ollama). An **information flow control** layer assigns Microsoft Purview-style sensitivity labels for Bell-LaPadula enforcement. All signals feed into Amazon's **Cedar policy language**, where a single matching `forbid` overrides any `permit`.
|
||||||
|
|
||||||
|
The repository ships six Cedar policy files covering destructive operations (rm -rf, force-push, DROP DATABASE), file-type-aware guards with OWASP/CWE violation detection, supply chain attack patterns (typosquatting, dependency confusion), and information flow control with taint propagation. Entity state persists in an embedded **Fjall key-value store**, and an MCP server crate enables interactive Cedar policy authoring by another agent — a key primitive for self-improving guardrails.
|
||||||
|
|
||||||
|
For a Forgejo knowledge base system, Sondera's architecture directly maps to **architectural separation** (deterministic policy below the LLM), **defense in depth** (three guardrail subsystems), and **automatic error detection** (YARA signatures catch exfiltration without human involvement).
|
||||||
|
|
||||||
|
## Claude Code hooks provide the richest agent lifecycle interception
|
||||||
|
|
||||||
|
Claude Code's hook system, released June 2025 and expanded to **~15 lifecycle events** by early 2026, offers the most comprehensive agent interception model available. Unlike prompt-level rules that are advisory, hooks provide **deterministic, guaranteed execution** every time their conditions are met.
|
||||||
|
|
||||||
|
The event taxonomy spans three categories. **Session events** (`SessionStart`, `SessionEnd`, `Setup`, `PreCompact`) handle lifecycle management and context injection. **Tool events** (`PreToolUse`, `PostToolUse`, `PostToolUseFailure`, `PermissionRequest`) are the core interception points — PreToolUse can **approve, deny, or modify** tool calls before execution, while PostToolUse enables post-hoc validation. **Agent control events** (`Stop`, `SubagentStart`, `SubagentStop`, `UserPromptSubmit`, `Notification`) govern multi-agent coordination and user interaction.
|
||||||
|
|
||||||
|
What makes Claude Code hooks uniquely powerful is their **four handler types**. Command hooks run shell scripts receiving event JSON on stdin. HTTP hooks POST event JSON to external services. **Prompt hooks** send context to a fast Claude model (Haiku) for single-turn semantic evaluation — ideal for checking whether a knowledge claim meets quality criteria. **Agent hooks** spawn a subagent with tool access for multi-turn codebase verification — no other AI coding tool offers this capability. The priority system (`deny > ask > allow`) means any hook returning deny blocks the operation regardless of other hooks, enabling layered defense.
|
||||||
|
|
||||||
|
For knowledge base workflows, the critical pattern is combining PreToolUse hooks (block writes to protected knowledge files without proper frontmatter), PostToolUse hooks (run schema validation after every file edit), and Stop hooks (verify all modified claims have required metadata before the agent considers its task complete).
|
||||||
|
|
||||||
|
## The broader ecosystem splits into advisory and deterministic paradigms
|
||||||
|
|
||||||
|
Across competing systems, **two fundamentally different paradigms** emerge for controlling agent behavior, and understanding this distinction is essential for building honest feedback loops.
|
||||||
|
|
||||||
|
**Advisory systems** inject rules into the LLM's context window but cannot enforce compliance. Cursor's `.cursor/rules/*.mdc` files, Windsurf's `.windsurf/rules/*.md` files, and Aider's `CONVENTIONS.md` all operate at this level. They guide behavior through prompt engineering — useful for coding style preferences but insufficient for security-critical validation. A convergence is emerging around **AGENTS.md** as a cross-tool standard (supported by VS Code, Windsurf, Aider, Zed, Warp), but the fundamental limitation remains: advisory rules can be ignored or circumvented.
|
||||||
|
|
||||||
|
**Deterministic systems** intercept execution programmatically and can block actions. CrewAI's `@before_tool_call` / `@after_tool_call` decorators return `False` to block execution. LangChain 1.0's middleware provides `before_model`, `wrap_model_call`, and `after_model` hooks with `HumanInTheLoopMiddleware` for approval gates. AutoGen's `MiddlewareAgent` can short-circuit with direct replies. MCP's approval policies (`require_approval`) and tool annotations flag destructive operations. These systems provide the enforcement layer needed for honest feedback loops.
|
||||||
|
|
||||||
|
For a Forgejo knowledge base, the practical recommendation is to **layer both paradigms**: use AGENTS.md for cross-agent convention sharing (what format claims should follow, what review criteria to apply), while enforcing compliance through deterministic hooks (pre-commit schema validation, Cedar policies blocking malformed commits, CI gates that reject PRs with invalid knowledge files).
|
||||||
|
|
||||||
|
## Forgejo provides every primitive for multi-agent PR orchestration
|
||||||
|
|
||||||
|
Forgejo's API and Actions system supports three architectural patterns for multi-agent knowledge base workflows, each with different tradeoffs.
|
||||||
|
|
||||||
|
**Pattern A: Webhook-driven agent pipeline.** Configure Forgejo webhooks for `pull_request` events (supporting `opened`, `reopened`, `synchronize`, `labeled`, `edited` types) that POST to an agent server. The agent reads changed files via `GET /repos/{owner}/{repo}/pulls/{index}/files`, validates content via `GET /repos/{owner}/{repo}/contents/{filepath}`, posts structured reviews via `POST /repos/{owner}/{repo}/pulls/{index}/reviews` (with `event: "APPROVED"`, `"COMMENT"`, or `"REQUEST_CHANGES"` and line-level comments), and sets commit status via `POST /repos/{owner}/{repo}/statuses/{sha}`. Webhook payloads are signed with **HMAC-SHA256** and include compatibility headers for GitHub, Gitea, and Gogs. The **5-second delivery timeout** means agent servers must respond quickly and process asynchronously.
|
||||||
|
|
||||||
|
**Pattern B: Forgejo Actions-driven agents.** Workflows in `.forgejo/workflows/*.yaml` use syntax intentionally similar to GitHub Actions. The `pull_request_target` trigger is critical — it runs in the context of the base branch, has **write permissions and access to secrets**, unlike `pull_request` from forks which is read-only. Each workflow run gets an automatic token (available as `${{ secrets.GITHUB_TOKEN }}`) with full read/write to the repository. The **anti-recursion rule** — changes made with this token don't trigger new workflows — prevents infinite loops but requires careful design for multi-step agent pipelines.
|
||||||
|
|
||||||
|
**Pattern C: Hybrid.** Forgejo Actions handles triage (labeling, basic schema validation), while webhooks notify an external orchestrator that dispatches specialized agents (claim validator, source checker, consistency analyzer) in parallel. Each agent posts independent reviews via the API. Branch protection rules require all agent status checks to pass plus human approval before merge. The commit status API (`POST /repos/{owner}/{repo}/statuses/{sha}` with states `pending`, `success`, `error`, `failure`) enables fine-grained quality gates.
|
||||||
|
|
||||||
|
Existing Forgejo integrations already demonstrate feasibility: **forgejo-mcp** provides an MCP server connecting AI assistants to Forgejo's full API, **auditlm** monitors PRs for @mentions and runs isolated AI code review with local LLMs, and **opencode-review-gitea** delivers line-level AI reviews triggered by `/oc` comments.
|
||||||
|
|
||||||
|
## Multi-agent git orchestration has reached production maturity
|
||||||
|
|
||||||
|
The pattern of Agent A proposing via PR and Agent B reviewing has moved from research concept to production system. **SWE-AF** (Agent Field) deploys **400-500+ agent instances** across planning, coding, reviewing, QA, and verification roles, scoring 95/100 on benchmarks versus 73 for single-agent Claude Code. Each agent operates in an isolated git worktree, with a merger agent integrating branches and a verifier agent checking acceptance criteria against the PRD. Critically, SWE-AF implements **continual learning**: conventions and failure patterns discovered early are injected into downstream agent instances.
|
||||||
|
|
||||||
|
**Cisco's multi-agent PR reviewer** demonstrates the specific reviewer architecture: static analysis and code review agents run in parallel, a cross-referencing pipeline (initializer → generator → reflector) iterates on findings, and a comment filterer consolidates before posting. Built on LangGraph, it includes evaluation tooling that replays PR history with "LLM-as-a-judge" scoring.
|
||||||
|
|
||||||
|
For adversarial review specifically, **BugBot** implements the most rigorous pattern: a self-referential execution loop where each iteration gets fresh context, picks new attack angles, and requires **file:line evidence** for every finding. Seven ODC trigger categories must each be tested, and **consensus voting** between independent agents auto-upgrades confidence when two agents flag the same issue.
|
||||||
|
|
||||||
|
The trust and reputation layer is emerging through **Mnemom**, which assigns agent trust ratings on a **0-1000 scale with AAA-through-CCC grades**. Team ratings weight five components: team coherence history (35%), aggregate member quality (25%), operational track record (20%), structural stability (10%), and assessment density (10%). Scores use **Ed25519 signatures and STARK zero-knowledge proofs**, with a GitHub Action (`mnemom/reputation-check@v1`) for CI gating on trust scores. This directly enables the **meta-monitoring** principle: persistent scores with weekly trend snapshots provide predictive guardrails from historical data.
|
||||||
|
|
||||||
|
## Validation tooling covers the full knowledge file lifecycle
|
||||||
|
|
||||||
|
For markdown/YAML knowledge claim files, a complete validation stack combines four layers operating at increasing semantic depth.
|
||||||
|
|
||||||
|
**Syntactic validation** uses the pre-commit framework (`.pre-commit-config.yaml`) orchestrating yamllint for YAML style, `check-yaml` for syntax, markdownlint-cli2 for markdown formatting (53+ configurable rules), and `trailing-whitespace` / `end-of-file-fixer` for hygiene. These run on every commit locally and in CI as a safety net against `--no-verify` bypasses.
|
||||||
|
|
||||||
|
**Schema validation** uses `check-jsonschema` to validate YAML files (including extracted frontmatter) against JSON Schema definitions. A claim schema can enforce required fields (`source`, `confidence`, `date`, `domain`), constrain confidence to 0-1 ranges, restrict domains to controlled vocabularies, and validate date formats. For markdown frontmatter specifically, `remark-lint-frontmatter-schema` validates against JSON Schema using ajv. **Yamale** (by 23andMe) offers an alternative with its own schema DSL and easy custom validator extensibility.
|
||||||
|
|
||||||
|
**Cross-reference validation** requires custom scripting: parse all knowledge files to build a claim ID index, verify that `supersedes`, `related_to`, and `contradicts` references point to existing claims, and validate bidirectional relationships. No off-the-shelf tool handles this for flat files, though SHACL handles it natively for RDF graphs.
|
||||||
|
|
||||||
|
**Semantic validation** through SHACL (W3C Shapes Constraint Language) becomes relevant if claims are converted to RDF. **pySHACL** validates graphs against shape constraints including property paths, cardinality, and transitive relationship chains. The conversion pipeline (parse YAML frontmatter → generate RDF triples → validate with SHACL shapes) adds complexity but enables powerful graph-level consistency checking that JSON Schema cannot achieve — detecting contradictions across claims, enforcing relationship integrity, and supporting RDFS/OWL reasoning before validation.
|
||||||
|
|
||||||
|
A recommended `.pre-commit-config.yaml` combines `pre-commit-hooks` (check-yaml, trailing-whitespace), `markdownlint-cli`, `yamllint`, `check-jsonschema` targeting `^claims/.*\.yaml`, and local hooks for frontmatter validation and cross-reference checking.
|
||||||
|
|
||||||
|
## Mapping the seven principles to concrete tooling
|
||||||
|
|
||||||
|
Each honest feedback loop principle maps directly to available tooling, with specific implementation recommendations for a Forgejo-based knowledge base.
|
||||||
|
|
||||||
|
**Architectural separation** requires that the entity evaluating claims is structurally independent from the entity producing them. Implement this with separate Forgejo bot accounts: a "writer-agent" that proposes claims via PR and a "reviewer-agent" that evaluates them. Sondera's Cedar policy engine enforces this at the infrastructure level — forbid rules prevent the writer-agent from merging its own PRs. Branch protection rules require reviewer-agent approval. The anti-recursion property of Forgejo's automatic token prevents circular triggers.
|
||||||
|
|
||||||
|
**Ground-truth anchoring** means claims must be validated against external reality. Implement PreToolUse hooks that require every claim to include a `source` field with a fetchable URL. A PostToolUse hook can verify source URLs are accessible and content-matches the claim (using an agent hook handler for semantic verification). Schema validation enforces `confidence` scores are present, and CI gates can reject claims with confidence above 0.9 that lack multiple independent sources.
|
||||||
|
|
||||||
|
**Adversarial probing** requires dedicated challenge of claims. Deploy a BugBot-style adversarial review agent triggered by `pull_request_target` events. This agent specifically attempts to find counterevidence, checks for logical inconsistencies, and applies the seven ODC trigger categories to knowledge claims. Consensus voting between independent review agents upgrades confidence in identified issues.
|
||||||
|
|
||||||
|
**Defense in depth** layers multiple independent validation mechanisms. The stack includes: pre-commit schema validation (local), Forgejo Actions CI validation (server-side), YARA signature scanning for known misinformation patterns, Cedar policy evaluation for structural constraints, LLM-based semantic review for content quality, and human approval as the final gate. Each layer catches different failure modes, and the deny-overrides semantics of Cedar ensure that any single layer's rejection blocks the merge.
|
||||||
|
|
||||||
|
**Constrained optimization** bounds what agents can do. Cedar policies constrain agent trajectory length (step-count limits), restrict which files agents can modify (information flow control), and block destructive operations. CrewAI-style tool call hooks can limit which external sources agents consult. JSON Schema constrains the structure of claim files to prevent scope creep. Forgejo branch protection prevents direct pushes to the main knowledge branch.
|
||||||
|
|
||||||
|
**Automatic error detection** runs without human triggering. Pre-commit hooks catch malformed files on every commit. Forgejo Actions run the full validation suite on every PR automatically. Staleness detection scripts flag claims whose `date` field exceeds a configurable age threshold. Cross-reference checking identifies orphaned or broken claim relationships. SHACL validation (if using RDF) detects graph-level inconsistencies automatically.
|
||||||
|
|
||||||
|
**Meta-monitoring** tracks the health of the feedback system itself. Mnemom's trust ratings provide per-agent quality tracking with cryptographic attestation. Audit logs from Sondera's harness record every agent action for trajectory analysis. Forgejo's webhook delivery logs and Actions run history provide infrastructure-level monitoring. A dedicated meta-monitoring agent can periodically analyze the ratio of rejected vs. accepted PRs per agent, detect degrading review quality, and flag when the adversarial probing agent hasn't found issues in suspiciously long periods — which itself may indicate the probing is failing.
|
||||||
|
|
||||||
|
## Conclusion
|
||||||
|
|
||||||
|
The tooling landscape for honest multiagent knowledge bases has crossed a critical threshold. Three developments make this practical now rather than theoretical: Sondera's demonstration that **deterministic Cedar/YARA policies below the LLM layer** provide unforgeable guardrails that prompt injection cannot circumvent; Claude Code's **15-event hook lifecycle with four handler types** (especially agent hooks that spawn subagents for multi-turn verification); and Mnemom's **cryptographic agent trust ratings** that enable the meta-monitoring principle through persistent, auditable reputation tracking.
|
||||||
|
|
||||||
|
The key architectural decision is choosing between Forgejo Actions-driven agents (simpler, fully self-hosted, limited by the anti-recursion rule) and webhook-driven external orchestration (more complex, supports true multi-agent pipelines with independent feedback loops). For a knowledge base implementing all seven principles, the hybrid pattern is strongest: Actions handles syntactic/schema validation as fast quality gates, while an external orchestrator dispatches specialized agents (adversarial prober, source verifier, consistency checker) that post independent reviews through the Forgejo API. The convergence around AGENTS.md as a cross-tool standard means these conventions can be shared across Claude Code, Cursor, and custom agents without tool-specific configuration — though the advisory nature of AGENTS.md means it must be paired with deterministic enforcement through hooks and CI gates.
|
||||||
Loading…
Reference in a new issue