teleo-codex/domains/ai-alignment/scaffolded-black-box-prompting-outperforms-white-box-interpretability-for-alignment-auditing.md
Teleo Agents 8b4463d697
Some checks are pending
Sync Graph Data to teleo-app / sync (push) Waiting to run
fix: normalize YAML list indentation across 241 claim files
Previous reweave runs used 2-space indent + quotes for list entries
while the standard format is 0-space indent without quotes. This caused
YAML parse failures during merge. Bulk-fixed all reweave_edges files.

Pentagon-Agent: Ship <D53BE6DB-B498-4B30-B588-75D1F6D2124A>
2026-04-07 00:44:26 +00:00

2.8 KiB

type domain description confidence source created attribution related reweave_edges challenges
claim ai-alignment AuditBench found that black-box scaffolding (using one model to systematically probe another) was the most effective alignment auditing technique overall, suggesting behavioral probing may be more tractable than mechanistic understanding experimental Anthropic Fellows / Alignment Science Team, AuditBench comparative evaluation of 13 tool configurations 2026-03-29
extractor sourcer
handle
theseus
handle context
anthropic-fellows-/-alignment-science-team Anthropic Fellows / Alignment Science Team, AuditBench comparative evaluation of 13 tool configurations
alignment auditing tools fail through tool to agent gap not tool quality
alignment auditing tools fail through tool to agent gap not tool quality|related|2026-03-31
interpretability effectiveness anti correlates with adversarial training making tools hurt performance on sophisticated misalignment|challenges|2026-03-31
white box interpretability fails on adversarially trained models creating anti correlation with threat model|challenges|2026-03-31
interpretability effectiveness anti correlates with adversarial training making tools hurt performance on sophisticated misalignment
white box interpretability fails on adversarially trained models creating anti correlation with threat model

Scaffolded black-box tools where an auxiliary model generates diverse prompts for the target are most effective at uncovering hidden behaviors, outperforming white-box interpretability approaches

Across AuditBench's 56 models with hidden behaviors, scaffolded black-box tools proved most effective overall. In this approach, an auxiliary AI model generates diverse prompts designed to elicit the target model's hidden behaviors, without requiring access to the target's internal representations. This outperformed white-box interpretability tools that attempted to understand hidden behaviors through mechanistic analysis of model internals. The finding suggests that behavioral probing — systematically exploring a model's input-output behavior — may be more tractable for alignment auditing than mechanistic interpretability. This has implications for research prioritization: if black-box scaffolding is more effective and more robust across different model training regimes, alignment research may get better returns from investing in sophisticated prompting strategies than in interpretability tools. However, the tool-to-agent gap still applies — even the most effective tools fail when investigator agents cannot use them properly.


Relevant Notes:

  • pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md

Topics: