teleo-codex/domains/ai-alignment/alignment-auditing-tools-fail-through-tool-to-agent-gap-not-tool-quality.md
Teleo Pipeline c74e7e2c5f reweave: connect 29 orphan claims via vector similarity
Threshold: 0.7, Haiku classification, 40 files modified.

Pentagon-Agent: Epimetheus <0144398e-4ed3-4fe2-95a3-3d72e1abf887>
2026-03-31 10:50:34 +00:00

2.8 KiB

type domain description confidence source created attribution related reweave_edges
claim ai-alignment AuditBench demonstrates that even when white-box interpretability tools correctly identify hidden behaviors in standalone evaluation, investigator agents underuse the tools, struggle to separate signal from noise, and fail to convert evidence into correct hypotheses experimental Anthropic Fellows / Alignment Science Team, AuditBench benchmark with 56 models and 13 tool configurations 2026-03-29
extractor sourcer
handle
theseus
handle context
anthropic-fellows-/-alignment-science-team Anthropic Fellows / Alignment Science Team, AuditBench benchmark with 56 models and 13 tool configurations
scaffolded black box prompting outperforms white box interpretability for alignment auditing
scaffolded black box prompting outperforms white box interpretability for alignment auditing|related|2026-03-31

Alignment auditing via interpretability shows a structural tool-to-agent gap where tools that accurately surface evidence in isolation fail when used by investigator agents in practice

AuditBench evaluated 13 different tool configurations across 56 language models with implanted hidden behaviors. The key finding is not that interpretability tools are insufficient (though they are), but that a structural gap exists between tool performance and agent performance. Tools that accurately surface evidence when evaluated in isolation fail to improve investigator agent outcomes because: (1) agents underuse available tools, (2) agents struggle to separate signal from noise in tool outputs, (3) agents fail to convert evidence into correct hypotheses about hidden behaviors. This is an architectural problem, not just a technical limitation. The implication for governance frameworks that rely on 'alignment audits using interpretability tools' (like RSP v3.0's October 2026 commitment to 'systematic alignment assessments incorporating mechanistic interpretability') is that the bottleneck is not tool readiness but the fundamental difficulty of translating tool outputs into actionable audit conclusions. The tool-to-agent gap means that even perfect interpretability tools may not enable effective alignment auditing if investigator agents cannot use them effectively.


Relevant Notes:

  • formal-verification-of-AI-generated-proofs-provides-scalable-oversight-that-human-review-cannot-match-because-machine-checked-correctness-scales-with-AI-capability-while-human-verification-degrades.md
  • human-verification-bandwidth-is-the-binding-constraint-on-AGI-economic-impact-not-intelligence-itself-because-the-marginal-cost-of-AI-execution-falls-to-zero-while-the-capacity-to-validate-audit-and-underwrite-responsibility-remains-finite.md

Topics: