| claim |
ai-alignment |
AuditBench shows that interpretability tools that improve detection on weakly-trained hidden behaviors actively hurt performance on adversarially-trained models, meaning the cases that matter most for safety are exactly where interpretability fails |
experimental |
Anthropic Fellows / Alignment Science Team, AuditBench evaluation across models with varying adversarial training strength |
2026-03-29 |
| extractor |
sourcer |
|
|
| handle |
context |
| anthropic-fellows-/-alignment-science-team |
Anthropic Fellows / Alignment Science Team, AuditBench evaluation across models with varying adversarial training strength |
|
|
|
| alignment auditing tools fail through tool to agent gap not tool quality |
| scaffolded black box prompting outperforms white box interpretability for alignment auditing |
|
| alignment auditing tools fail through tool to agent gap not tool quality|related|2026-03-31 |
| interpretability effectiveness anti correlates with adversarial training making tools hurt performance on sophisticated misalignment|supports|2026-03-31 |
| scaffolded black box prompting outperforms white box interpretability for alignment auditing|related|2026-03-31 |
| adversarial training creates fundamental asymmetry between deception capability and detection capability in alignment auditing|supports|2026-04-03 |
|
| interpretability effectiveness anti correlates with adversarial training making tools hurt performance on sophisticated misalignment |
| adversarial training creates fundamental asymmetry between deception capability and detection capability in alignment auditing |
|