| claim |
ai-alignment |
AuditBench shows that interpretability tools that improve detection on weakly-trained hidden behaviors actively hurt performance on adversarially-trained models, meaning the cases that matter most for safety are exactly where interpretability fails |
experimental |
Anthropic Fellows / Alignment Science Team, AuditBench evaluation across models with varying adversarial training strength |
2026-03-29 |
| extractor |
sourcer |
|
|
| handle |
context |
| anthropic-fellows-/-alignment-science-team |
Anthropic Fellows / Alignment Science Team, AuditBench evaluation across models with varying adversarial training strength |
|
|
|
| alignment-auditing-tools-fail-through-tool-to-agent-gap-not-tool-quality |
| scaffolded-black-box-prompting-outperforms-white-box-interpretability-for-alignment-auditing |
| Mechanistic interpretability tools create a dual-use attack surface where Sparse Autoencoders developed for alignment research can identify and surgically remove safety-related features |
|
| alignment-auditing-tools-fail-through-tool-to-agent-gap-not-tool-quality|related|2026-03-31 |
| interpretability-effectiveness-anti-correlates-with-adversarial-training-making-tools-hurt-performance-on-sophisticated-misalignment|supports|2026-03-31 |
| scaffolded-black-box-prompting-outperforms-white-box-interpretability-for-alignment-auditing|related|2026-03-31 |
| adversarial-training-creates-fundamental-asymmetry-between-deception-capability-and-detection-capability-in-alignment-auditing|supports|2026-04-03 |
| Mechanistic interpretability tools create a dual-use attack surface where Sparse Autoencoders developed for alignment research can identify and surgically remove safety-related features|related|2026-04-08 |
|
| interpretability-effectiveness-anti-correlates-with-adversarial-training-making-tools-hurt-performance-on-sophisticated-misalignment |
| adversarial-training-creates-fundamental-asymmetry-between-deception-capability-and-detection-capability-in-alignment-auditing |
|