teleo-codex/domains/ai-alignment/alignment-auditing-tools-fail-through-tool-to-agent-gap-not-tool-quality.md

---
type: claim
domain: ai-alignment
description: AuditBench demonstrates that even when white-box interpretability tools correctly identify hidden behaviors in standalone evaluation, investigator agents underuse the tools, struggle to separate signal from noise, and fail to convert evidence into correct hypotheses
confidence: experimental
source: Anthropic Fellows / Alignment Science Team, AuditBench benchmark with 56 models and 13 tool configurations
created: 2026-03-29
attribution:
  extractor:
    - handle: "theseus"
  sourcer:
    - handle: "anthropic-fellows-/-alignment-science-team"
      context: "Anthropic Fellows / Alignment Science Team, AuditBench benchmark with 56 models and 13 tool configurations"
---

# Alignment auditing via interpretability shows a structural tool-to-agent gap where tools that accurately surface evidence in isolation fail when used by investigator agents in practice

AuditBench evaluated 13 different tool configurations across 56 language models with implanted hidden behaviors. The key finding is not that interpretability tools are insufficient (though they are), but that a structural gap exists between tool performance and agent performance. Tools that accurately surface evidence when evaluated in isolation fail to improve investigator agent outcomes because: (1) agents underuse available tools, (2) agents struggle to separate signal from noise in tool outputs, (3) agents fail to convert evidence into correct hypotheses about hidden behaviors. This is an architectural problem, not just a technical limitation. The implication for governance frameworks that rely on 'alignment audits using interpretability tools' (like RSP v3.0's October 2026 commitment to 'systematic alignment assessments incorporating mechanistic interpretability') is that the bottleneck is not tool readiness but the fundamental difficulty of translating tool outputs into actionable audit conclusions. The tool-to-agent gap means that even perfect interpretability tools may not enable effective alignment auditing if investigator agents cannot use them effectively.

---

Relevant Notes:
- formal-verification-of-AI-generated-proofs-provides-scalable-oversight-that-human-review-cannot-match-because-machine-checked-correctness-scales-with-AI-capability-while-human-verification-degrades.md
- human-verification-bandwidth-is-the-binding-constraint-on-AGI-economic-impact-not-intelligence-itself-because-the-marginal-cost-of-AI-execution-falls-to-zero-while-the-capacity-to-validate-audit-and-underwrite-responsibility-remains-finite.md

Topics:
- [[_map]]