teleo-codex/domains/ai-alignment/deterministic policy engines operating below the LLM layer cannot be circumverted by prompt injection making them essential for adversarial-grade AI agent control.md
Alex dba00a7960 Recover alexastrum contributions from GitHub PR #68 (lost during mirror sync)
6 claims + 1 source originally merged Mar 9 via GitHub squash merge.
Forgejo→GitHub mirror overwrote GitHub main, erasing these files.
Recovered from unreachable commit 9bd6c77c before GitHub GC.
Added sourcer: alexastrum attribution to claim frontmatter.
2026-04-16 16:46:26 +00:00

4 KiB

type domain description confidence source sourcer created
claim ai-alignment Sondera's Cedar/YARA reference monitor demonstrates that intercepting agent actions at the execution layer — not the prompt layer — provides guardrails that prompt injection cannot bypass, establishing a fundamental architectural distinction for AI safety infrastructure. experimental Alex — based on Compass research artifact analyzing Sondera (sondera-ai/sondera-coding-agent-hooks), Claude Code hooks, and the broader agent control ecosystem (2026-03-08) alexastrum 2026-03-08

Deterministic policy engines operating below the LLM layer cannot be circumvented by prompt injection making them essential for adversarial-grade AI agent control

Two fundamentally different paradigms exist for controlling AI agent behavior, and understanding this distinction is essential for building trustworthy multi-agent systems.

Advisory systems inject rules into the LLM's context window but cannot enforce compliance. Cursor's .cursor/rules/*.mdc files, Windsurf's .windsurf/rules/*.md files, Aider's CONVENTIONS.md, and the emerging AGENTS.md cross-tool standard all operate at this level. They guide behavior through prompt engineering — useful for coding style preferences but insufficient for security-critical validation. The fundamental limitation: advisory rules can be ignored or circumvented by prompt injection, model drift, or context window overflow.

Deterministic systems intercept execution programmatically and can block actions regardless of what the LLM intended. Sondera's reference monitor (released at Unprompted 2026) demonstrates the strongest form: a Rust-based harness using YARA-X signatures for pattern matching and Amazon's Cedar policy language for access control, intercepting every shell command, file operation, and web request made by Claude Code, Cursor, GitHub Copilot, and Gemini CLI. A single matching Cedar forbid overrides any permit — the deny-overrides semantics ensure that no prompt injection can authorize a blocked action.

The architectural point is structural, not about any particular tool. When the enforcement mechanism operates below the LLM — intercepting tool calls, file writes, and shell commands at the execution boundary — the LLM cannot reason its way past the constraint. This is the same principle that makes OS-level permissions more reliable than application-level access checks: the enforcement point is outside the entity being constrained.

Additional deterministic systems confirm the pattern: CrewAI's @before_tool_call / @after_tool_call decorators return False to block execution; LangChain 1.0's middleware provides before_model, wrap_model_call, and after_model hooks; AutoGen's MiddlewareAgent can short-circuit with direct replies; MCP's approval policies flag destructive operations.

The practical recommendation for any multi-agent knowledge system is to layer both paradigms: use advisory rules (AGENTS.md, CLAUDE.md) for convention sharing, while enforcing compliance through deterministic hooks, Cedar policies, and CI gates that cannot be bypassed by the agents they constrain.


Relevant Notes:

Topics: