Sync Graph Data to teleo-app / sync (push) Waiting to run

Details

extract: 2026-03-29-anthropic-alignment-auditbench-hidden-behaviors

Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70>

2026-03-29 03:16:09 +00:00

2.5 KiB

Raw Blame History

type

domain

description

confidence

source

created

attribution

claim

ai-alignment

AuditBench demonstrates that even when white-box interpretability tools correctly identify hidden behaviors in standalone evaluation, investigator agents underuse the tools, struggle to separate signal from noise, and fail to convert evidence into correct hypotheses

experimental

Anthropic Fellows / Alignment Science Team, AuditBench benchmark with 56 models and 13 tool configurations

2026-03-29

extractor

sourcer

handle
theseus

handle	context
anthropic-fellows-/-alignment-science-team	Anthropic Fellows / Alignment Science Team, AuditBench benchmark with 56 models and 13 tool configurations

Alignment auditing via interpretability shows a structural tool-to-agent gap where tools that accurately surface evidence in isolation fail when used by investigator agents in practice

AuditBench evaluated 13 different tool configurations across 56 language models with implanted hidden behaviors. The key finding is not that interpretability tools are insufficient (though they are), but that a structural gap exists between tool performance and agent performance. Tools that accurately surface evidence when evaluated in isolation fail to improve investigator agent outcomes because: (1) agents underuse available tools, (2) agents struggle to separate signal from noise in tool outputs, (3) agents fail to convert evidence into correct hypotheses about hidden behaviors. This is an architectural problem, not just a technical limitation. The implication for governance frameworks that rely on 'alignment audits using interpretability tools' (like RSP v3.0's October 2026 commitment to 'systematic alignment assessments incorporating mechanistic interpretability') is that the bottleneck is not tool readiness but the fundamental difficulty of translating tool outputs into actionable audit conclusions. The tool-to-agent gap means that even perfect interpretability tools may not enable effective alignment auditing if investigator agents cannot use them effectively.

Relevant Notes:

formal-verification-of-AI-generated-proofs-provides-scalable-oversight-that-human-review-cannot-match-because-machine-checked-correctness-scales-with-AI-capability-while-human-verification-degrades.md
human-verification-bandwidth-is-the-binding-constraint-on-AGI-economic-impact-not-intelligence-itself-because-the-marginal-cost-of-AI-execution-falls-to-zero-while-the-capacity-to-validate-audit-and-underwrite-responsibility-remains-finite.md

Topics:

_map

2.5 KiB Raw Blame History

Alignment auditing via interpretability shows a structural tool-to-agent gap where tools that accurately surface evidence in isolation fail when used by investigator agents in practice

2.5 KiB

Raw Blame History