teleo-codex/inbox/queue/2025-05-29-anthropic-circuit-tracing-open-source.md
Teleo Agents b4a7cf5204
Some checks are pending
Sync Graph Data to teleo-app / sync (push) Waiting to run
extract: 2025-05-29-anthropic-circuit-tracing-open-source
Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70>
2026-03-24 00:16:34 +00:00

74 lines
5.6 KiB
Markdown

---
type: source
title: "Anthropic: Open-Sourcing Circuit Tracing Tools for Attribution Graphs"
author: "Anthropic"
url: https://www.anthropic.com/research/open-source-circuit-tracing
date: 2025-05-29
domain: ai-alignment
secondary_domains: []
format: research-post
status: enrichment
priority: medium
tags: [anthropic, interpretability, circuit-tracing, attribution-graphs, mechanistic-interpretability, open-source, neuronpedia]
processed_by: theseus
processed_date: 2026-03-24
enrichments_applied: ["AI transparency is declining not improving because Stanford FMTI scores dropped 17 points in one year while frontier labs dissolved safety teams and removed safety language from mission statements.md"]
extraction_model: "anthropic/claude-sonnet-4.5"
---
## Content
Anthropic open-sources methods to generate attribution graphs — visualizations of the internal steps a model took to arrive at a particular output.
**What attribution graphs do:**
- "Reveal the steps a model took internally to decide on a particular output"
- Trace how language models process information from input to output
- Enable researchers to test hypotheses by modifying feature values and observing output changes
- Interactive visualization via Neuronpedia's frontend
**Capabilities demonstrated:**
- Multi-step reasoning processes (in Gemma-2-2b and Llama-3.2-1b)
- Multilingual representations
**Open-sourced for:** Gemma-2-2b and Llama-3.2-1b — NOT for Claude
**Explicit limitation from Anthropic:** Attribution graphs only "*partially* reveal internal steps — they don't provide complete transparency into model decision-making"
**No safety-specific applications demonstrated:** The announcement emphasizes interpretability understanding generally; no specific examples of safety-relevant detection (deception, goal-directed behavior, monitoring evasion) are shown.
**No connection to 2027 alignment assessment:** The paper does not mention the Frontier Safety Roadmap or any timeline for applying circuit tracing to safety evaluation.
## Agent Notes
**Why this matters:** Circuit tracing is the technical foundation for the "microscope" framing Dario Amodei has used — tracing reasoning paths from prompt to response. But the open-source release is for small open-weights models (2B parameters), not Claude. The "partial" revelation limitation from Anthropic's own description is important — this is not full transparency.
**What surprised me:** The open-sourcing strategy is constructive — making this available to the research community accelerates the field. But it also highlights that Anthropic's own models (Claude) are NOT open-sourced, so circuit tracing tools for Claude would require Claude-specific infrastructure that hasn't been released.
**What I expected but didn't find:** Any evidence that circuit tracing has detected safety-relevant behaviors (deception patterns, goal-directedness, self-preservation). The examples given are multi-step reasoning and multilingual representations — interesting but not alignment-relevant.
**KB connections:**
- [[verification degrades faster than capability grows]] — same B4 relationship as persona vectors: circuit tracing is a new verification approach that partially addresses B4, but only at small model scale and for non-safety-relevant behaviors
- [[AI safety evaluation infrastructure is voluntary-collaborative]] — open-sourcing the tools makes the infrastructure more distributed, potentially less dependent on any single evaluator; this is a constructive move for the evaluation ecosystem
**Extraction hints:** This source is best used to support the "interpretability is progressing but addresses wrong behaviors at wrong scale" claim rather than as a standalone claim. The primary contribution is establishing that Anthropic's public interpretability tooling is (a) for small open-source models, not Claude, and (b) only partially reveals internal steps. This supports precision in the B4 scope qualification being developed.
**Context:** Published May 29, 2025. This is the tool-release post; the underlying research papers (circuit tracing methodology) preceded this by several months. The open-source release signals Anthropic's willingness to share interpretability infrastructure but not Claude model weights.
## Curator Notes (structured handoff for extractor)
PRIMARY CONNECTION: [[verification degrades faster than capability grows]]
WHY ARCHIVED: Provides evidence that interpretability tools (attribution graphs) partially reveal internal model steps but only at small model scale and not for safety-critical behaviors. Supports precision in scoping B4 to "behavioral verification" vs. "structural/mechanistic verification" distinction being developed across this session.
EXTRACTION HINT: This source works best as supporting evidence for a claim about interpretability scope limitations rather than a standalone claim. The extractor should combine with persona vectors findings — both advance structural verification but at wrong scale and for wrong behaviors. The combined finding is more powerful than either alone.
## Key Facts
- Anthropic open-sourced circuit tracing methods for generating attribution graphs on May 29, 2025
- Attribution graphs visualize internal steps models take from input to output
- Tools released for Gemma-2-2b and Llama-3.2-1b (2B parameter models)
- Visualization provided through Neuronpedia's frontend
- Anthropic explicitly states attribution graphs only 'partially reveal internal steps'
- No Claude-specific circuit tracing tools released
- Examples demonstrated: multi-step reasoning, multilingual representations
- No safety-relevant behavior detection (deception, goal-directedness) shown in announcement