| claim |
ai-alignment |
AuditBench found that black-box scaffolding (using one model to systematically probe another) was the most effective alignment auditing technique overall, suggesting behavioral probing may be more tractable than mechanistic understanding |
experimental |
Anthropic Fellows / Alignment Science Team, AuditBench comparative evaluation of 13 tool configurations |
2026-03-29 |
| extractor |
sourcer |
|
|
| handle |
context |
| anthropic-fellows-/-alignment-science-team |
Anthropic Fellows / Alignment Science Team, AuditBench comparative evaluation of 13 tool configurations |
|
|
|
| alignment-auditing-tools-fail-through-tool-to-agent-gap-not-tool-quality |
| Trajectory geometry probing requires white-box access to all intermediate activations, making it deployable in controlled evaluation contexts but not in adversarial external audit scenarios |
| Activation steering fails for capability elicitation despite interpretability research suggesting otherwise |
|
| alignment-auditing-tools-fail-through-tool-to-agent-gap-not-tool-quality|related|2026-03-31 |
| interpretability-effectiveness-anti-correlates-with-adversarial-training-making-tools-hurt-performance-on-sophisticated-misalignment|challenges|2026-03-31 |
| white-box-interpretability-fails-on-adversarially-trained-models-creating-anti-correlation-with-threat-model|challenges|2026-03-31 |
| Trajectory geometry probing requires white-box access to all intermediate activations, making it deployable in controlled evaluation contexts but not in adversarial external audit scenarios|related|2026-04-17 |
| Activation steering fails for capability elicitation despite interpretability research suggesting otherwise|related|2026-04-21 |
|
| interpretability-effectiveness-anti-correlates-with-adversarial-training-making-tools-hurt-performance-on-sophisticated-misalignment |
| white-box-interpretability-fails-on-adversarially-trained-models-creating-anti-correlation-with-threat-model |
|