teleo-codex/domains/ai-alignment/alignment-auditing-tools-fail-through-tool-to-agent-gap-not-just-technical-limitations.md
Teleo Agents 43982050c3 extract: 2026-03-30-oxford-aigi-automated-interpretability-model-auditing-research-agenda
Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70>
2026-03-30 01:07:00 +00:00

3.5 KiB

type domain description confidence source created attribution
claim ai-alignment AuditBench shows that even when white-box interpretability tools accurately surface evidence in standalone evaluation, investigator agents fail to use them effectively in practice, revealing a structural problem beyond technical capability experimental Anthropic Fellows/Alignment Science Team, AuditBench benchmark with 56 models across 13 tool configurations 2026-03-29
extractor sourcer
handle
theseus
handle context
anthropic-fellows-/-alignment-science-team Anthropic Fellows/Alignment Science Team, AuditBench benchmark with 56 models across 13 tool configurations

Alignment auditing tools fail through a tool-to-agent gap where interpretability methods that surface evidence in isolation fail when used by investigator agents because agents underuse tools struggle to separate signal from noise and cannot convert evidence into correct hypotheses

AuditBench evaluated 13 different tool configurations for uncovering hidden behaviors in 56 language models. The most surprising finding was not that interpretability tools have technical limitations, but that tools which perform well in standalone non-agentic evaluations systematically fail when used by investigator agents in practice. This 'tool-to-agent gap' manifests in three ways: agents may underuse the tool even when it would help, struggle to separate signal from noise in tool outputs, and fail to convert evidence into correct hypotheses about hidden behaviors. Critically, white-box interpretability tools helped on easier targets but failed on models with more robust adversarial training—exactly the cases that matter most for alignment. Scaffolded black-box tools (where an auxiliary model generates diverse prompts) proved most effective overall. This suggests the problem is architectural: the gap between tool capability and agent utilization is not just an engineering challenge but a fundamental limitation in how investigator agents can leverage interpretability evidence. This directly challenges governance frameworks like RSP v3.0 that commit to 'systematic alignment assessments incorporating mechanistic interpretability' by October 2026, because the bottleneck is not interpretability readiness but the structural inability of auditing agents to use interpretability tools effectively on adversarially trained systems.


Additional Evidence (extend)

Source: 2026-03-30-oxford-aigi-automated-interpretability-model-auditing-research-agenda | Added: 2026-03-30

Oxford AIGI's January 2026 research agenda proposes agent-mediated correction as a solution: domain experts query model behavior, receive grounded explanations, and instruct targeted corrections through an interpretability pipeline optimized for actionability (can experts identify and fix errors) rather than technical accuracy. This is the constructive proposal to the problem AuditBench documented empirically, though it remains pre-empirical validation.

Relevant Notes:

  • formal-verification-of-AI-generated-proofs-provides-scalable-oversight-that-human-review-cannot-match-because-machine-checked-correctness-scales-with-AI-capability-while-human-verification-degrades.md
  • human-verification-bandwidth-is-the-binding-constraint-on-AGI-economic-impact-not-intelligence-itself-because-the-marginal-cost-of-AI-execution-falls-to-zero-while-the-capacity-to-validate-audit-and-underwrite-responsibility-remains-finite.md

Topics: