5.8 KiB
| type | title | author | url | date | domain | secondary_domains | format | status | processed_by | processed_date | priority | tags | extraction_model | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| source | Anthropic Circuit Tracing Release — Production-Scale Interpretability on Claude 3.5 Haiku | Anthropic Interpretability Team | https://transformer-circuits.pub/2025/attribution-graphs/biology.html | 2025-03-01 | ai-alignment | research-paper | processed | theseus | 2026-04-02 | medium |
|
anthropic/claude-sonnet-4.5 |
Content
In March 2025, Anthropic published "Circuit Tracing: Revealing Computational Graphs in Language Models" and open-sourced associated tools. The work introduces cross-layer transcoders (CLTs) — a new type of sparse autoencoder that reads from one layer's residual stream but provides output to all subsequent MLP layers.
Technical approach:
- Replaces model's MLPs with cross-layer transcoders
- Transcoders represent neurons with more interpretable "features" — human-understandable concepts
- Attribution graphs show which features influence which other features across the model
- Applied to Claude 3.5 Haiku (Anthropic's lightweight production model, released October 2024)
Demonstrated results on Claude 3.5 Haiku:
- Two-hop reasoning: Researchers traced how "the capital of the state containing Dallas" → "Texas" → "Austin." They could see and manipulate the internal representation of "Texas" as an intermediate step
- Poetry planning: Before writing each line of poetry, the model identifies potential rhyming words that could appear at the end — planning happens before execution, and this is visible in attribution graphs
- Multi-step reasoning traced end-to-end: From prompt to response, researchers could follow the chain of feature activations
- Language-independent concepts: Abstract concepts represented consistently regardless of language input
Open-source release: Anthropic open-sourced the circuit tracing Python library (compatible with any open-weights model) and a frontend on Neuronpedia for exploring attribution graphs.
Dario Amodei's stated goal (April 2025 essay "The Urgency of Interpretability"): "Reliably detect most AI model problems by 2027" — framing interpretability as an "MRI for AI" that can identify deceptive tendencies, power-seeking, and jailbreak vulnerabilities before deployment.
What this doesn't demonstrate:
- Detection of scheming or deceptive alignment (reasoning and planning are demonstrated, but deceptive intention is not)
- Scaling beyond Claude 3.5 Haiku to larger frontier models (Haiku is the smallest production Claude)
- Real-time oversight at deployment speed
- Robustness against adversarially trained models (AuditBench finding shows white-box tools fail on adversarially trained models)
Agent Notes
Why this matters: This is the strongest evidence for genuine technical progress in interpretability — demonstrating real results at production model scale, not just toy models. The two-hop reasoning trace is impressive: researchers can see and manipulate intermediate representations in a production model. This is a genuine advancement.
What surprised me: The scale: this is Claude 3.5 Haiku, a deployed production model — not a research toy. That's meaningful. But also: the limitations gap. Dario's 2027 goal ("reliably detect most model problems") is still a target, not a current capability. The demonstrated results show how the model reasons, not whether the model has hidden goals or deceptive tendencies.
What I expected but didn't find: Demonstration on Claude 3.5 Sonnet or larger. Haiku is specifically the lightweight model; the techniques may not scale to larger variants.
KB connections:
- Directly relevant to B4 — genuine technical progress, but not at the scale needed for alignment-relevant oversight
- Contrasts with DeepMind's negative SAE results: Anthropic's results are positive, DeepMind's are negative. Different approaches (circuit tracing vs. SAEs for harmful intent detection) — but both are under the "mechanistic interpretability" umbrella. This tension is worth noting.
- The Anthropic "MRI for AI" framing is optimistic future projection; current demonstrated capability is more limited
Extraction hints:
- CLAIM: "Mechanistic interpretability at production model scale can trace multi-step reasoning pathways but cannot yet detect deceptive alignment or covert goal-pursuing — there is a gap between demonstrated interpretability capability (how it reasons) and alignment-relevant verification capability (whether it has deceptive goals)"
- Possible divergence candidate: Anthropic's ambitious reverse-engineering approach (circuit tracing, goal: detect most problems by 2027) vs. DeepMind's pragmatic pivot (use what works, SAEs fail on harmful intent) — competing interpretability strategies
Context: Published in Anthropic's Transformer Circuits research series. Received wide attention and is part of why MIT Technology Review named mechanistic interpretability a "2026 Breakthrough Technology." The open-source release is intended to enable external researchers to apply the tools.
Curator Notes (structured handoff for extractor)
PRIMARY CONNECTION: Verification degrades faster than capability grows (B4) — this is the strongest counter-evidence, showing real progress at production scale WHY ARCHIVED: Most concrete positive evidence for interpretability progress; important to present against the negative findings (DeepMind SAE, scaling limits) to show the full picture EXTRACTION HINT: Extractor should note the specific gap: demonstrated capability (tracing reasoning) vs. needed capability (detecting deceptive goals) — this distinction is what prevents the Anthropic results from weakening B4