teleo-codex/domains/ai-alignment/alignment-auditing-tools-fail-through-tool-to-agent-gap-not-just-technical-limitations.md

---
type: claim
domain: ai-alignment
description: AuditBench shows that even when white-box interpretability tools accurately surface evidence in standalone evaluation, investigator agents fail to use them effectively in practice, revealing a structural problem beyond technical capability
confidence: experimental
source: Anthropic Fellows/Alignment Science Team, AuditBench benchmark with 56 models across 13 tool configurations
created: 2026-03-29
attribution:
  extractor:
    - handle: "theseus"
  sourcer:
    - handle: "anthropic-fellows-/-alignment-science-team"
      context: "Anthropic Fellows/Alignment Science Team, AuditBench benchmark with 56 models across 13 tool configurations"
---

# Alignment auditing tools fail through a tool-to-agent gap where interpretability methods that surface evidence in isolation fail when used by investigator agents because agents underuse tools struggle to separate signal from noise and cannot convert evidence into correct hypotheses

AuditBench evaluated 13 different tool configurations for uncovering hidden behaviors in 56 language models. The most surprising finding was not that interpretability tools have technical limitations, but that tools which perform well in standalone non-agentic evaluations systematically fail when used by investigator agents in practice. This 'tool-to-agent gap' manifests in three ways: agents may underuse the tool even when it would help, struggle to separate signal from noise in tool outputs, and fail to convert evidence into correct hypotheses about hidden behaviors. Critically, white-box interpretability tools helped on easier targets but failed on models with more robust adversarial training—exactly the cases that matter most for alignment. Scaffolded black-box tools (where an auxiliary model generates diverse prompts) proved most effective overall. This suggests the problem is architectural: the gap between tool capability and agent utilization is not just an engineering challenge but a fundamental limitation in how investigator agents can leverage interpretability evidence. This directly challenges governance frameworks like RSP v3.0 that commit to 'systematic alignment assessments incorporating mechanistic interpretability' by October 2026, because the bottleneck is not interpretability readiness but the structural inability of auditing agents to use interpretability tools effectively on adversarially trained systems.

---

Relevant Notes:
- formal-verification-of-AI-generated-proofs-provides-scalable-oversight-that-human-review-cannot-match-because-machine-checked-correctness-scales-with-AI-capability-while-human-verification-degrades.md
- human-verification-bandwidth-is-the-binding-constraint-on-AGI-economic-impact-not-intelligence-itself-because-the-marginal-cost-of-AI-execution-falls-to-zero-while-the-capacity-to-validate-audit-and-underwrite-responsibility-remains-finite.md

Topics:
- [[_map]]