m3taversal f63eb8000a fix: normalize 1,072 broken wiki-links across 604 files

Mechanical space→hyphen conversion in frontmatter references
(related_claims, challenges, supports, etc.) to match actual
filenames. Fixes 26.9% broken link rate found by wiki-link audit.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-21 10:21:26 +01:00

3.7 KiB

Raw Blame History

type

domain

description

confidence

source

created

attribution

supports

reweave_edges

claim

ai-alignment

AuditBench demonstrates that white-box interpretability tools work on easier targets but the gap between tool performance in isolation and agent performance suggests task complexity, not capability, is the bottleneck

experimental

Abhay Sheshadri et al., Anthropic Fellows Program, AuditBench benchmark with 56 models across 13 tool configurations

2026-03-30

extractor

sourcer

handle
theseus

handle	context
anthropic-fellows-program	Abhay Sheshadri et al., Anthropic Fellows Program, AuditBench benchmark with 56 models across 13 tool configurations

adversarial-training-creates-fundamental-asymmetry-between-deception-capability-and-detection-capability-in-alignment-auditing

agent-mediated-correction-proposes-closing-tool-to-agent-gap-through-domain-expert-actionability

adversarial-training-creates-fundamental-asymmetry-between-deception-capability-and-detection-capability-in-alignment-auditing|supports|2026-04-03

agent-mediated-correction-proposes-closing-tool-to-agent-gap-through-domain-expert-actionability|supports|2026-04-03

capability-scaling-increases-error-incoherence-on-difficult-tasks-inverting-the-expected-relationship-between-model-size-and-behavioral-predictability|related|2026-04-03

frontier-ai-failures-shift-from-systematic-bias-to-incoherent-variance-as-task-complexity-and-reasoning-length-increase|related|2026-04-03

capability-scaling-increases-error-incoherence-on-difficult-tasks-inverting-the-expected-relationship-between-model-size-and-behavioral-predictability

frontier-ai-failures-shift-from-systematic-bias-to-incoherent-variance-as-task-complexity-and-reasoning-length-increase

Alignment auditing shows a structural tool-to-agent gap where interpretability tools that accurately surface evidence in isolation fail when used by investigator agents because agents underuse tools, struggle to separate signal from noise, and fail to convert evidence into correct hypotheses

AuditBench evaluated 56 LLMs with implanted hidden behaviors using investigator agents with access to configurable tool sets across 13 different configurations. The key finding is a structural tool-to-agent gap: tools that surface accurate evidence when used in isolation fail to improve agent performance in practice. Investigator agents systematically underuse available tools, struggle to separate signal from noise in tool outputs, and fail to convert evidence into correct hypotheses about hidden behaviors. This gap persists even when the tools themselves are working correctly—white-box interpretability tools help on easier targets (lightly adversarially trained models) but fail on models with more robust adversarial training. The paper explicitly states this is a task complexity issue, not an engineering problem, suggesting the gap is fundamental rather than fixable through better prompting or agent architecture. This matters because Anthropic's RSP v3 (released the same month, February 2026) commits to 'systematic alignment assessments incorporating mechanistic interpretability'—but this research from Anthropic-funded fellows shows that interpretability tools may only catch naive misalignment, not sophisticated embedded misalignment, and even when they work, agents can't use them effectively.

Relevant Notes:

scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps
AI capability and reliability are independent dimensions because Claude solved a 30 year open mathematical problem while simultaneously degrading at basic program execution during the same session

Topics:

_map

3.7 KiB Raw Blame History

Alignment auditing shows a structural tool-to-agent gap where interpretability tools that accurately surface evidence in isolation fail when used by investigator agents because agents underuse tools, struggle to separate signal from noise, and fail to convert evidence into correct hypotheses

3.7 KiB

Raw Blame History