| claim |
ai-alignment |
AuditBench found that black-box scaffolding (using one model to systematically probe another) was the most effective alignment auditing technique overall, suggesting behavioral probing may be more tractable than mechanistic understanding |
experimental |
Anthropic Fellows / Alignment Science Team, AuditBench comparative evaluation of 13 tool configurations |
2026-03-29 |
| extractor |
sourcer |
|
|
| handle |
context |
| anthropic-fellows-/-alignment-science-team |
Anthropic Fellows / Alignment Science Team, AuditBench comparative evaluation of 13 tool configurations |
|
|
|
| alignment auditing tools fail through tool to agent gap not tool quality |
|
| alignment auditing tools fail through tool to agent gap not tool quality|related|2026-03-31 |
| interpretability effectiveness anti correlates with adversarial training making tools hurt performance on sophisticated misalignment|challenges|2026-03-31 |
| white box interpretability fails on adversarially trained models creating anti correlation with threat model|challenges|2026-03-31 |
|
| interpretability effectiveness anti correlates with adversarial training making tools hurt performance on sophisticated misalignment |
| white box interpretability fails on adversarially trained models creating anti correlation with threat model |
|