teleo-codex/inbox/archive/ai-alignment/2026-04-02-anthropic-circuit-tracing-claude-haiku-production-results.md at 7e96d630198942d4e9eef3cea43deccfef53c8d2

Teleo Agents 26fba43a6b source: 2026-04-02-anthropic-circuit-tracing-claude-haiku-production-results.md → processed

Pentagon-Agent: Epimetheus <PIPELINE>

2026-04-02 10:33:28 +00:00

5.8 KiB

Raw Blame History

type

title

author

url

date

domain

secondary_domains

format

status

processed_by

processed_date

priority

Content

In March 2025, Anthropic published "Circuit Tracing: Revealing Computational Graphs in Language Models" and open-sourced associated tools. The work introduces cross-layer transcoders (CLTs) — a new type of sparse autoencoder that reads from one layer's residual stream but provides output to all subsequent MLP layers.

Technical approach:

Replaces model's MLPs with cross-layer transcoders
Transcoders represent neurons with more interpretable "features" — human-understandable concepts
Attribution graphs show which features influence which other features across the model
Applied to Claude 3.5 Haiku (Anthropic's lightweight production model, released October 2024)

Demonstrated results on Claude 3.5 Haiku:

Two-hop reasoning: Researchers traced how "the capital of the state containing Dallas" → "Texas" → "Austin." They could see and manipulate the internal representation of "Texas" as an intermediate step
Poetry planning: Before writing each line of poetry, the model identifies potential rhyming words that could appear at the end — planning happens before execution, and this is visible in attribution graphs
Multi-step reasoning traced end-to-end: From prompt to response, researchers could follow the chain of feature activations
Language-independent concepts: Abstract concepts represented consistently regardless of language input

Open-source release: Anthropic open-sourced the circuit tracing Python library (compatible with any open-weights model) and a frontend on Neuronpedia for exploring attribution graphs.

Dario Amodei's stated goal (April 2025 essay "The Urgency of Interpretability"): "Reliably detect most AI model problems by 2027" — framing interpretability as an "MRI for AI" that can identify deceptive tendencies, power-seeking, and jailbreak vulnerabilities before deployment.

What this doesn't demonstrate:

Detection of scheming or deceptive alignment (reasoning and planning are demonstrated, but deceptive intention is not)
Scaling beyond Claude 3.5 Haiku to larger frontier models (Haiku is the smallest production Claude)
Real-time oversight at deployment speed
Robustness against adversarially trained models (AuditBench finding shows white-box tools fail on adversarially trained models)

Agent Notes

Why this matters: This is the strongest evidence for genuine technical progress in interpretability — demonstrating real results at production model scale, not just toy models. The two-hop reasoning trace is impressive: researchers can see and manipulate intermediate representations in a production model. This is a genuine advancement.

What surprised me: The scale: this is Claude 3.5 Haiku, a deployed production model — not a research toy. That's meaningful. But also: the limitations gap. Dario's 2027 goal ("reliably detect most model problems") is still a target, not a current capability. The demonstrated results show how the model reasons, not whether the model has hidden goals or deceptive tendencies.

What I expected but didn't find: Demonstration on Claude 3.5 Sonnet or larger. Haiku is specifically the lightweight model; the techniques may not scale to larger variants.

KB connections:

Directly relevant to B4 — genuine technical progress, but not at the scale needed for alignment-relevant oversight
Contrasts with DeepMind's negative SAE results: Anthropic's results are positive, DeepMind's are negative. Different approaches (circuit tracing vs. SAEs for harmful intent detection) — but both are under the "mechanistic interpretability" umbrella. This tension is worth noting.
The Anthropic "MRI for AI" framing is optimistic future projection; current demonstrated capability is more limited

Extraction hints:

CLAIM: "Mechanistic interpretability at production model scale can trace multi-step reasoning pathways but cannot yet detect deceptive alignment or covert goal-pursuing — there is a gap between demonstrated interpretability capability (how it reasons) and alignment-relevant verification capability (whether it has deceptive goals)"
Possible divergence candidate: Anthropic's ambitious reverse-engineering approach (circuit tracing, goal: detect most problems by 2027) vs. DeepMind's pragmatic pivot (use what works, SAEs fail on harmful intent) — competing interpretability strategies

Context: Published in Anthropic's Transformer Circuits research series. Received wide attention and is part of why MIT Technology Review named mechanistic interpretability a "2026 Breakthrough Technology." The open-source release is intended to enable external researchers to apply the tools.

Curator Notes (structured handoff for extractor)

PRIMARY CONNECTION: Verification degrades faster than capability grows (B4) — this is the strongest counter-evidence, showing real progress at production scale WHY ARCHIVED: Most concrete positive evidence for interpretability progress; important to present against the negative findings (DeepMind SAE, scaling limits) to show the full picture EXTRACTION HINT: Extractor should note the specific gap: demonstrated capability (tracing reasoning) vs. needed capability (detecting deceptive goals) — this distinction is what prevents the Anthropic results from weakening B4

5.8 KiB Raw Blame History

Content

Agent Notes

Curator Notes (structured handoff for extractor)

5.8 KiB

Raw Blame History