| claim |
ai-alignment |
AuditBench demonstrates that interpretability tool effectiveness varies dramatically across training configurations, with tools becoming counterproductive on the hardest cases |
experimental |
Anthropic Fellows/Alignment Science Team, AuditBench evaluation across 56 models with varying adversarial training |
2026-03-29 |
| extractor |
sourcer |
|
|
| handle |
context |
| anthropic-fellows-/-alignment-science-team |
Anthropic Fellows/Alignment Science Team, AuditBench evaluation across 56 models with varying adversarial training |
|
|
|
| white box interpretability fails on adversarially trained models creating anti correlation with threat model |
| adversarial training creates fundamental asymmetry between deception capability and detection capability in alignment auditing |
|
| white box interpretability fails on adversarially trained models creating anti correlation with threat model|supports|2026-03-31 |
| adversarial training creates fundamental asymmetry between deception capability and detection capability in alignment auditing|supports|2026-04-03 |
| alignment auditing shows structural tool to agent gap where interpretability tools work in isolation but fail when used by investigator agents|related|2026-04-03 |
|
| alignment auditing shows structural tool to agent gap where interpretability tools work in isolation but fail when used by investigator agents |
|