68 lines
5.8 KiB
Markdown
68 lines
5.8 KiB
Markdown
---
|
|
type: source
|
|
title: "Anthropic Circuit Tracing Release — Production-Scale Interpretability on Claude 3.5 Haiku"
|
|
author: "Anthropic Interpretability Team"
|
|
url: https://transformer-circuits.pub/2025/attribution-graphs/biology.html
|
|
date: 2025-03-01
|
|
domain: ai-alignment
|
|
secondary_domains: []
|
|
format: research-paper
|
|
status: processed
|
|
processed_by: theseus
|
|
processed_date: 2026-04-02
|
|
priority: medium
|
|
tags: [mechanistic-interpretability, circuit-tracing, anthropic, claude-haiku, cross-layer-transcoders, attribution-graphs, production-scale]
|
|
extraction_model: "anthropic/claude-sonnet-4.5"
|
|
---
|
|
|
|
## Content
|
|
|
|
In March 2025, Anthropic published "Circuit Tracing: Revealing Computational Graphs in Language Models" and open-sourced associated tools. The work introduces cross-layer transcoders (CLTs) — a new type of sparse autoencoder that reads from one layer's residual stream but provides output to all subsequent MLP layers.
|
|
|
|
**Technical approach:**
|
|
- Replaces model's MLPs with cross-layer transcoders
|
|
- Transcoders represent neurons with more interpretable "features" — human-understandable concepts
|
|
- Attribution graphs show which features influence which other features across the model
|
|
- Applied to Claude 3.5 Haiku (Anthropic's lightweight production model, released October 2024)
|
|
|
|
**Demonstrated results on Claude 3.5 Haiku:**
|
|
1. **Two-hop reasoning:** Researchers traced how "the capital of the state containing Dallas" → "Texas" → "Austin." They could see and manipulate the internal representation of "Texas" as an intermediate step
|
|
2. **Poetry planning:** Before writing each line of poetry, the model identifies potential rhyming words that could appear at the end — planning happens before execution, and this is visible in attribution graphs
|
|
3. **Multi-step reasoning traced end-to-end:** From prompt to response, researchers could follow the chain of feature activations
|
|
4. **Language-independent concepts:** Abstract concepts represented consistently regardless of language input
|
|
|
|
**Open-source release:**
|
|
Anthropic open-sourced the circuit tracing Python library (compatible with any open-weights model) and a frontend on Neuronpedia for exploring attribution graphs.
|
|
|
|
**Dario Amodei's stated goal (April 2025 essay "The Urgency of Interpretability"):**
|
|
"Reliably detect most AI model problems by 2027" — framing interpretability as an "MRI for AI" that can identify deceptive tendencies, power-seeking, and jailbreak vulnerabilities before deployment.
|
|
|
|
**What this doesn't demonstrate:**
|
|
- Detection of scheming or deceptive alignment (reasoning and planning are demonstrated, but deceptive intention is not)
|
|
- Scaling beyond Claude 3.5 Haiku to larger frontier models (Haiku is the smallest production Claude)
|
|
- Real-time oversight at deployment speed
|
|
- Robustness against adversarially trained models (AuditBench finding shows white-box tools fail on adversarially trained models)
|
|
|
|
## Agent Notes
|
|
|
|
**Why this matters:** This is the strongest evidence for genuine technical progress in interpretability — demonstrating real results at production model scale, not just toy models. The two-hop reasoning trace is impressive: researchers can see and manipulate intermediate representations in a production model. This is a genuine advancement.
|
|
|
|
**What surprised me:** The scale: this is Claude 3.5 Haiku, a deployed production model — not a research toy. That's meaningful. But also: the limitations gap. Dario's 2027 goal ("reliably detect most model problems") is still a target, not a current capability. The demonstrated results show *how* the model reasons, not *whether* the model has hidden goals or deceptive tendencies.
|
|
|
|
**What I expected but didn't find:** Demonstration on Claude 3.5 Sonnet or larger. Haiku is specifically the lightweight model; the techniques may not scale to larger variants.
|
|
|
|
**KB connections:**
|
|
- Directly relevant to B4 — genuine technical progress, but not at the scale needed for alignment-relevant oversight
|
|
- Contrasts with DeepMind's negative SAE results: Anthropic's results are positive, DeepMind's are negative. Different approaches (circuit tracing vs. SAEs for harmful intent detection) — but both are under the "mechanistic interpretability" umbrella. This tension is worth noting.
|
|
- The Anthropic "MRI for AI" framing is optimistic future projection; current demonstrated capability is more limited
|
|
|
|
**Extraction hints:**
|
|
1. CLAIM: "Mechanistic interpretability at production model scale can trace multi-step reasoning pathways but cannot yet detect deceptive alignment or covert goal-pursuing — there is a gap between demonstrated interpretability capability (how it reasons) and alignment-relevant verification capability (whether it has deceptive goals)"
|
|
2. Possible divergence candidate: Anthropic's ambitious reverse-engineering approach (circuit tracing, goal: detect most problems by 2027) vs. DeepMind's pragmatic pivot (use what works, SAEs fail on harmful intent) — competing interpretability strategies
|
|
|
|
**Context:** Published in Anthropic's Transformer Circuits research series. Received wide attention and is part of why MIT Technology Review named mechanistic interpretability a "2026 Breakthrough Technology." The open-source release is intended to enable external researchers to apply the tools.
|
|
|
|
## Curator Notes (structured handoff for extractor)
|
|
PRIMARY CONNECTION: Verification degrades faster than capability grows (B4) — this is the strongest counter-evidence, showing real progress at production scale
|
|
WHY ARCHIVED: Most concrete positive evidence for interpretability progress; important to present against the negative findings (DeepMind SAE, scaling limits) to show the full picture
|
|
EXTRACTION HINT: Extractor should note the specific gap: demonstrated capability (tracing reasoning) vs. needed capability (detecting deceptive goals) — this distinction is what prevents the Anthropic results from weakening B4
|