m3taversal f63eb8000a fix: normalize 1,072 broken wiki-links across 604 files

Mechanical space→hyphen conversion in frontmatter references
(related_claims, challenges, supports, etc.) to match actual
filenames. Fixes 26.9% broken link rate found by wiki-link audit.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-21 10:21:26 +01:00

3.4 KiB

Raw Blame History

type

domain

description

confidence

source

created

attribution

reweave_edges

challenges

claim

ai-alignment

AuditBench found that black-box scaffolding (using one model to systematically probe another) was the most effective alignment auditing technique overall, suggesting behavioral probing may be more tractable than mechanistic understanding

experimental

Anthropic Fellows / Alignment Science Team, AuditBench comparative evaluation of 13 tool configurations

2026-03-29

extractor

sourcer

handle
theseus

handle	context
anthropic-fellows-/-alignment-science-team	Anthropic Fellows / Alignment Science Team, AuditBench comparative evaluation of 13 tool configurations

alignment-auditing-tools-fail-through-tool-to-agent-gap-not-tool-quality

Trajectory geometry probing requires white-box access to all intermediate activations, making it deployable in controlled evaluation contexts but not in adversarial external audit scenarios

Activation steering fails for capability elicitation despite interpretability research suggesting otherwise

alignment-auditing-tools-fail-through-tool-to-agent-gap-not-tool-quality|related|2026-03-31

interpretability-effectiveness-anti-correlates-with-adversarial-training-making-tools-hurt-performance-on-sophisticated-misalignment|challenges|2026-03-31

white-box-interpretability-fails-on-adversarially-trained-models-creating-anti-correlation-with-threat-model|challenges|2026-03-31

Trajectory geometry probing requires white-box access to all intermediate activations, making it deployable in controlled evaluation contexts but not in adversarial external audit scenarios|related|2026-04-17

Activation steering fails for capability elicitation despite interpretability research suggesting otherwise|related|2026-04-21

interpretability-effectiveness-anti-correlates-with-adversarial-training-making-tools-hurt-performance-on-sophisticated-misalignment

white-box-interpretability-fails-on-adversarially-trained-models-creating-anti-correlation-with-threat-model

Scaffolded black-box tools where an auxiliary model generates diverse prompts for the target are most effective at uncovering hidden behaviors, outperforming white-box interpretability approaches

Across AuditBench's 56 models with hidden behaviors, scaffolded black-box tools proved most effective overall. In this approach, an auxiliary AI model generates diverse prompts designed to elicit the target model's hidden behaviors, without requiring access to the target's internal representations. This outperformed white-box interpretability tools that attempted to understand hidden behaviors through mechanistic analysis of model internals. The finding suggests that behavioral probing — systematically exploring a model's input-output behavior — may be more tractable for alignment auditing than mechanistic interpretability. This has implications for research prioritization: if black-box scaffolding is more effective and more robust across different model training regimes, alignment research may get better returns from investing in sophisticated prompting strategies than in interpretability tools. However, the tool-to-agent gap still applies — even the most effective tools fail when investigator agents cannot use them properly.

Relevant Notes:

pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md

Topics:

_map

3.4 KiB Raw Blame History

Scaffolded black-box tools where an auxiliary model generates diverse prompts for the target are most effective at uncovering hidden behaviors, outperforming white-box interpretability approaches

3.4 KiB

Raw Blame History