teleo-codex/domains/ai-alignment/alignment-auditing-tools-fail-through-tool-to-agent-gap-not-tool-quality.md
Teleo Agents d50a919ed5
Some checks are pending
Sync Graph Data to teleo-app / sync (push) Waiting to run
extract: 2026-03-29-anthropic-alignment-auditbench-hidden-behaviors
Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70>
2026-03-29 03:16:09 +00:00

2.5 KiB

type domain description confidence source created attribution
claim ai-alignment AuditBench demonstrates that even when white-box interpretability tools correctly identify hidden behaviors in standalone evaluation, investigator agents underuse the tools, struggle to separate signal from noise, and fail to convert evidence into correct hypotheses experimental Anthropic Fellows / Alignment Science Team, AuditBench benchmark with 56 models and 13 tool configurations 2026-03-29
extractor sourcer
handle
theseus
handle context
anthropic-fellows-/-alignment-science-team Anthropic Fellows / Alignment Science Team, AuditBench benchmark with 56 models and 13 tool configurations

Alignment auditing via interpretability shows a structural tool-to-agent gap where tools that accurately surface evidence in isolation fail when used by investigator agents in practice

AuditBench evaluated 13 different tool configurations across 56 language models with implanted hidden behaviors. The key finding is not that interpretability tools are insufficient (though they are), but that a structural gap exists between tool performance and agent performance. Tools that accurately surface evidence when evaluated in isolation fail to improve investigator agent outcomes because: (1) agents underuse available tools, (2) agents struggle to separate signal from noise in tool outputs, (3) agents fail to convert evidence into correct hypotheses about hidden behaviors. This is an architectural problem, not just a technical limitation. The implication for governance frameworks that rely on 'alignment audits using interpretability tools' (like RSP v3.0's October 2026 commitment to 'systematic alignment assessments incorporating mechanistic interpretability') is that the bottleneck is not tool readiness but the fundamental difficulty of translating tool outputs into actionable audit conclusions. The tool-to-agent gap means that even perfect interpretability tools may not enable effective alignment auditing if investigator agents cannot use them effectively.


Relevant Notes:

  • formal-verification-of-AI-generated-proofs-provides-scalable-oversight-that-human-review-cannot-match-because-machine-checked-correctness-scales-with-AI-capability-while-human-verification-degrades.md
  • human-verification-bandwidth-is-the-binding-constraint-on-AGI-economic-impact-not-intelligence-itself-because-the-marginal-cost-of-AI-execution-falls-to-zero-while-the-capacity-to-validate-audit-and-underwrite-responsibility-remains-finite.md

Topics: