| claim |
ai-alignment |
AuditBench demonstrates that even when white-box interpretability tools correctly identify hidden behaviors in standalone evaluation, investigator agents underuse the tools, struggle to separate signal from noise, and fail to convert evidence into correct hypotheses |
experimental |
Anthropic Fellows / Alignment Science Team, AuditBench benchmark with 56 models and 13 tool configurations |
2026-03-29 |
| extractor |
sourcer |
|
|
| handle |
context |
| anthropic-fellows-/-alignment-science-team |
Anthropic Fellows / Alignment Science Team, AuditBench benchmark with 56 models and 13 tool configurations |
|
|
|
| scaffolded-black-box-prompting-outperforms-white-box-interpretability-for-alignment-auditing |
| activation-steering-fails-for-capability-elicitation-despite-interpretability-research-suggesting-otherwise |
|
| scaffolded-black-box-prompting-outperforms-white-box-interpretability-for-alignment-auditing|related|2026-03-31 |
| agent-mediated-correction-proposes-closing-tool-to-agent-gap-through-domain-expert-actionability|supports|2026-04-03 |
| alignment-auditing-shows-structural-tool-to-agent-gap-where-interpretability-tools-work-in-isolation-but-fail-when-used-by-investigator-agents|supports|2026-04-03 |
|
| agent-mediated-correction-proposes-closing-tool-to-agent-gap-through-domain-expert-actionability |
| alignment-auditing-shows-structural-tool-to-agent-gap-where-interpretability-tools-work-in-isolation-but-fail-when-used-by-investigator-agents |
|