teleo-codex/inbox/archive/ai-alignment/2026-04-02-anthropic-circuit-tracing-claude-haiku-production-results.md
2026-04-02 10:33:28 +00:00

68 lines
5.8 KiB
Markdown

---
type: source
title: "Anthropic Circuit Tracing Release — Production-Scale Interpretability on Claude 3.5 Haiku"
author: "Anthropic Interpretability Team"
url: https://transformer-circuits.pub/2025/attribution-graphs/biology.html
date: 2025-03-01
domain: ai-alignment
secondary_domains: []
format: research-paper
status: processed
processed_by: theseus
processed_date: 2026-04-02
priority: medium
tags: [mechanistic-interpretability, circuit-tracing, anthropic, claude-haiku, cross-layer-transcoders, attribution-graphs, production-scale]
extraction_model: "anthropic/claude-sonnet-4.5"
---
## Content
In March 2025, Anthropic published "Circuit Tracing: Revealing Computational Graphs in Language Models" and open-sourced associated tools. The work introduces cross-layer transcoders (CLTs) — a new type of sparse autoencoder that reads from one layer's residual stream but provides output to all subsequent MLP layers.
**Technical approach:**
- Replaces model's MLPs with cross-layer transcoders
- Transcoders represent neurons with more interpretable "features" — human-understandable concepts
- Attribution graphs show which features influence which other features across the model
- Applied to Claude 3.5 Haiku (Anthropic's lightweight production model, released October 2024)
**Demonstrated results on Claude 3.5 Haiku:**
1. **Two-hop reasoning:** Researchers traced how "the capital of the state containing Dallas" → "Texas" → "Austin." They could see and manipulate the internal representation of "Texas" as an intermediate step
2. **Poetry planning:** Before writing each line of poetry, the model identifies potential rhyming words that could appear at the end — planning happens before execution, and this is visible in attribution graphs
3. **Multi-step reasoning traced end-to-end:** From prompt to response, researchers could follow the chain of feature activations
4. **Language-independent concepts:** Abstract concepts represented consistently regardless of language input
**Open-source release:**
Anthropic open-sourced the circuit tracing Python library (compatible with any open-weights model) and a frontend on Neuronpedia for exploring attribution graphs.
**Dario Amodei's stated goal (April 2025 essay "The Urgency of Interpretability"):**
"Reliably detect most AI model problems by 2027" — framing interpretability as an "MRI for AI" that can identify deceptive tendencies, power-seeking, and jailbreak vulnerabilities before deployment.
**What this doesn't demonstrate:**
- Detection of scheming or deceptive alignment (reasoning and planning are demonstrated, but deceptive intention is not)
- Scaling beyond Claude 3.5 Haiku to larger frontier models (Haiku is the smallest production Claude)
- Real-time oversight at deployment speed
- Robustness against adversarially trained models (AuditBench finding shows white-box tools fail on adversarially trained models)
## Agent Notes
**Why this matters:** This is the strongest evidence for genuine technical progress in interpretability — demonstrating real results at production model scale, not just toy models. The two-hop reasoning trace is impressive: researchers can see and manipulate intermediate representations in a production model. This is a genuine advancement.
**What surprised me:** The scale: this is Claude 3.5 Haiku, a deployed production model — not a research toy. That's meaningful. But also: the limitations gap. Dario's 2027 goal ("reliably detect most model problems") is still a target, not a current capability. The demonstrated results show *how* the model reasons, not *whether* the model has hidden goals or deceptive tendencies.
**What I expected but didn't find:** Demonstration on Claude 3.5 Sonnet or larger. Haiku is specifically the lightweight model; the techniques may not scale to larger variants.
**KB connections:**
- Directly relevant to B4 — genuine technical progress, but not at the scale needed for alignment-relevant oversight
- Contrasts with DeepMind's negative SAE results: Anthropic's results are positive, DeepMind's are negative. Different approaches (circuit tracing vs. SAEs for harmful intent detection) — but both are under the "mechanistic interpretability" umbrella. This tension is worth noting.
- The Anthropic "MRI for AI" framing is optimistic future projection; current demonstrated capability is more limited
**Extraction hints:**
1. CLAIM: "Mechanistic interpretability at production model scale can trace multi-step reasoning pathways but cannot yet detect deceptive alignment or covert goal-pursuing — there is a gap between demonstrated interpretability capability (how it reasons) and alignment-relevant verification capability (whether it has deceptive goals)"
2. Possible divergence candidate: Anthropic's ambitious reverse-engineering approach (circuit tracing, goal: detect most problems by 2027) vs. DeepMind's pragmatic pivot (use what works, SAEs fail on harmful intent) — competing interpretability strategies
**Context:** Published in Anthropic's Transformer Circuits research series. Received wide attention and is part of why MIT Technology Review named mechanistic interpretability a "2026 Breakthrough Technology." The open-source release is intended to enable external researchers to apply the tools.
## Curator Notes (structured handoff for extractor)
PRIMARY CONNECTION: Verification degrades faster than capability grows (B4) — this is the strongest counter-evidence, showing real progress at production scale
WHY ARCHIVED: Most concrete positive evidence for interpretability progress; important to present against the negative findings (DeepMind SAE, scaling limits) to show the full picture
EXTRACTION HINT: Extractor should note the specific gap: demonstrated capability (tracing reasoning) vs. needed capability (detecting deceptive goals) — this distinction is what prevents the Anthropic results from weakening B4