| claim |
ai-alignment |
AuditBench demonstrates that white-box interpretability tools work on easier targets but the gap between tool performance in isolation and agent performance suggests task complexity, not capability, is the bottleneck |
experimental |
Abhay Sheshadri et al., Anthropic Fellows Program, AuditBench benchmark with 56 models across 13 tool configurations |
2026-03-30 |
| extractor |
sourcer |
|
|
| handle |
context |
| anthropic-fellows-program |
Abhay Sheshadri et al., Anthropic Fellows Program, AuditBench benchmark with 56 models across 13 tool configurations |
|
|
|
| adversarial-training-creates-fundamental-asymmetry-between-deception-capability-and-detection-capability-in-alignment-auditing |
| agent-mediated-correction-proposes-closing-tool-to-agent-gap-through-domain-expert-actionability |
|
| adversarial-training-creates-fundamental-asymmetry-between-deception-capability-and-detection-capability-in-alignment-auditing|supports|2026-04-03 |
| agent-mediated-correction-proposes-closing-tool-to-agent-gap-through-domain-expert-actionability|supports|2026-04-03 |
| capability-scaling-increases-error-incoherence-on-difficult-tasks-inverting-the-expected-relationship-between-model-size-and-behavioral-predictability|related|2026-04-03 |
| frontier-ai-failures-shift-from-systematic-bias-to-incoherent-variance-as-task-complexity-and-reasoning-length-increase|related|2026-04-03 |
|
| capability-scaling-increases-error-incoherence-on-difficult-tasks-inverting-the-expected-relationship-between-model-size-and-behavioral-predictability |
| frontier-ai-failures-shift-from-systematic-bias-to-incoherent-variance-as-task-complexity-and-reasoning-length-increase |
|