teleo-codex/inbox/archive/ai-alignment/2026-04-02-anthropic-circuit-tracing-claude-haiku-production-results.md
2026-04-02 10:33:28 +00:00

5.8 KiB

type title author url date domain secondary_domains format status processed_by processed_date priority tags extraction_model
source Anthropic Circuit Tracing Release — Production-Scale Interpretability on Claude 3.5 Haiku Anthropic Interpretability Team https://transformer-circuits.pub/2025/attribution-graphs/biology.html 2025-03-01 ai-alignment
research-paper processed theseus 2026-04-02 medium
mechanistic-interpretability
circuit-tracing
anthropic
claude-haiku
cross-layer-transcoders
attribution-graphs
production-scale
anthropic/claude-sonnet-4.5

Content

In March 2025, Anthropic published "Circuit Tracing: Revealing Computational Graphs in Language Models" and open-sourced associated tools. The work introduces cross-layer transcoders (CLTs) — a new type of sparse autoencoder that reads from one layer's residual stream but provides output to all subsequent MLP layers.

Technical approach:

  • Replaces model's MLPs with cross-layer transcoders
  • Transcoders represent neurons with more interpretable "features" — human-understandable concepts
  • Attribution graphs show which features influence which other features across the model
  • Applied to Claude 3.5 Haiku (Anthropic's lightweight production model, released October 2024)

Demonstrated results on Claude 3.5 Haiku:

  1. Two-hop reasoning: Researchers traced how "the capital of the state containing Dallas" → "Texas" → "Austin." They could see and manipulate the internal representation of "Texas" as an intermediate step
  2. Poetry planning: Before writing each line of poetry, the model identifies potential rhyming words that could appear at the end — planning happens before execution, and this is visible in attribution graphs
  3. Multi-step reasoning traced end-to-end: From prompt to response, researchers could follow the chain of feature activations
  4. Language-independent concepts: Abstract concepts represented consistently regardless of language input

Open-source release: Anthropic open-sourced the circuit tracing Python library (compatible with any open-weights model) and a frontend on Neuronpedia for exploring attribution graphs.

Dario Amodei's stated goal (April 2025 essay "The Urgency of Interpretability"): "Reliably detect most AI model problems by 2027" — framing interpretability as an "MRI for AI" that can identify deceptive tendencies, power-seeking, and jailbreak vulnerabilities before deployment.

What this doesn't demonstrate:

  • Detection of scheming or deceptive alignment (reasoning and planning are demonstrated, but deceptive intention is not)
  • Scaling beyond Claude 3.5 Haiku to larger frontier models (Haiku is the smallest production Claude)
  • Real-time oversight at deployment speed
  • Robustness against adversarially trained models (AuditBench finding shows white-box tools fail on adversarially trained models)

Agent Notes

Why this matters: This is the strongest evidence for genuine technical progress in interpretability — demonstrating real results at production model scale, not just toy models. The two-hop reasoning trace is impressive: researchers can see and manipulate intermediate representations in a production model. This is a genuine advancement.

What surprised me: The scale: this is Claude 3.5 Haiku, a deployed production model — not a research toy. That's meaningful. But also: the limitations gap. Dario's 2027 goal ("reliably detect most model problems") is still a target, not a current capability. The demonstrated results show how the model reasons, not whether the model has hidden goals or deceptive tendencies.

What I expected but didn't find: Demonstration on Claude 3.5 Sonnet or larger. Haiku is specifically the lightweight model; the techniques may not scale to larger variants.

KB connections:

  • Directly relevant to B4 — genuine technical progress, but not at the scale needed for alignment-relevant oversight
  • Contrasts with DeepMind's negative SAE results: Anthropic's results are positive, DeepMind's are negative. Different approaches (circuit tracing vs. SAEs for harmful intent detection) — but both are under the "mechanistic interpretability" umbrella. This tension is worth noting.
  • The Anthropic "MRI for AI" framing is optimistic future projection; current demonstrated capability is more limited

Extraction hints:

  1. CLAIM: "Mechanistic interpretability at production model scale can trace multi-step reasoning pathways but cannot yet detect deceptive alignment or covert goal-pursuing — there is a gap between demonstrated interpretability capability (how it reasons) and alignment-relevant verification capability (whether it has deceptive goals)"
  2. Possible divergence candidate: Anthropic's ambitious reverse-engineering approach (circuit tracing, goal: detect most problems by 2027) vs. DeepMind's pragmatic pivot (use what works, SAEs fail on harmful intent) — competing interpretability strategies

Context: Published in Anthropic's Transformer Circuits research series. Received wide attention and is part of why MIT Technology Review named mechanistic interpretability a "2026 Breakthrough Technology." The open-source release is intended to enable external researchers to apply the tools.

Curator Notes (structured handoff for extractor)

PRIMARY CONNECTION: Verification degrades faster than capability grows (B4) — this is the strongest counter-evidence, showing real progress at production scale WHY ARCHIVED: Most concrete positive evidence for interpretability progress; important to present against the negative findings (DeepMind SAE, scaling limits) to show the full picture EXTRACTION HINT: Extractor should note the specific gap: demonstrated capability (tracing reasoning) vs. needed capability (detecting deceptive goals) — this distinction is what prevents the Anthropic results from weakening B4