| claim |
ai-alignment |
AuditBench demonstrates that white-box interpretability tools work on easier targets but the gap between tool performance in isolation and agent performance suggests task complexity, not capability, is the bottleneck |
experimental |
Abhay Sheshadri et al., Anthropic Fellows Program, AuditBench benchmark with 56 models across 13 tool configurations |
2026-03-30 |
| extractor |
sourcer |
|
|
| handle |
context |
| anthropic-fellows-program |
Abhay Sheshadri et al., Anthropic Fellows Program, AuditBench benchmark with 56 models across 13 tool configurations |
|
|
|
| adversarial training creates fundamental asymmetry between deception capability and detection capability in alignment auditing |
| agent mediated correction proposes closing tool to agent gap through domain expert actionability |
|
| adversarial training creates fundamental asymmetry between deception capability and detection capability in alignment auditing|supports|2026-04-03 |
| agent mediated correction proposes closing tool to agent gap through domain expert actionability|supports|2026-04-03 |
| capability scaling increases error incoherence on difficult tasks inverting the expected relationship between model size and behavioral predictability|related|2026-04-03 |
| frontier ai failures shift from systematic bias to incoherent variance as task complexity and reasoning length increase|related|2026-04-03 |
|
| capability scaling increases error incoherence on difficult tasks inverting the expected relationship between model size and behavioral predictability |
| frontier ai failures shift from systematic bias to incoherent variance as task complexity and reasoning length increase |
|