67 lines
5.6 KiB
Markdown
67 lines
5.6 KiB
Markdown
---
|
|
type: source
|
|
title: "Circuit Tracing for the Rest of Us: From Probes to Attribution Graphs and What It Means for Production Safety"
|
|
author: "Subhadip Mitra (@subhadipmitra)"
|
|
url: https://subhadipmitra.com/blog/2026/circuit-tracing-production/
|
|
date: 2026-01-01
|
|
domain: ai-alignment
|
|
secondary_domains: []
|
|
format: article
|
|
status: unprocessed
|
|
priority: medium
|
|
tags: [mechanistic-interpretability, circuit-tracing, production-safety, attribution-graphs, SAE, sandbagging-probes]
|
|
---
|
|
|
|
## Content
|
|
|
|
Subhadip Mitra's 2026 analysis documents the transition of mechanistic interpretability from research direction to practical engineering discipline, specifically examining what Anthropic's circuit tracing work means for production safety pipelines.
|
|
|
|
**Key observations:**
|
|
- Mechanistic interpretability is "moving from 'interesting research direction' to 'practical engineering discipline,' with this transition happening faster than expected"
|
|
- Anthropic demonstrated circuit tracing on Claude 3.5 Haiku; the community now needs this capability on open-weight models (Llama, Mistral, Qwen, Gemma) — Mitra's sandbagging probes are an attempt at this
|
|
- "Next-generation safety tools will need to work at the representation level: detecting harmful intent in a model's internal state before it produces output"
|
|
- Circuit tracing extends from detection to understanding — revealing both *that* deception occurs and *where* in the circuit intervention is possible
|
|
|
|
**On the Anthropic/DeepMind divergence:**
|
|
- Anthropic: circuit tracing → attribution graphs → emotion vectors (all toward deeper mechanistic understanding)
|
|
- DeepMind: pivoted to pragmatic interpretability after SAEs underperformed linear probes on harmful intent detection
|
|
- These are complementary, not competing: "DeepMind uses what works, Anthropic builds the map. You need both."
|
|
|
|
**On community democratization:**
|
|
- Anthropic open-sourcing circuit tracing tools enables community research on popular open-weight models
|
|
- Neuronpedia hosts an interactive frontend for attribution graph exploration
|
|
- The key remaining bottleneck: "it currently takes a few hours of human effort to understand the circuits even on prompts with only tens of words"
|
|
- SPAR's "Automating Circuit Interpretability with Agents" project directly targets this bottleneck
|
|
|
|
**The production safety application:**
|
|
- Mitra documented that Anthropic applied mechanistic interpretability in pre-deployment safety assessment of Claude Sonnet 4.5 for the first time
|
|
- The assessment examined internal features for dangerous capabilities, deceptive tendencies, or undesired goals
|
|
- This represents the first integration of interpretability research into deployment decisions for a production system
|
|
|
|
## Agent Notes
|
|
|
|
**Why this matters:** Provides the synthesis view of where mechanistic interpretability stands as of early 2026 — bridging the research papers (Anthropic, DeepMind) to practical safety tooling. Mitra is a practitioner-level commentator whose sandbagging probes represent community-level operationalization of interpretability. His framing of Anthropic/DeepMind as complementary (not competing) is analytically useful.
|
|
|
|
**What surprised me:** The "hours per prompt" bottleneck is explicitly documented here. This is what the SPAR "Automating Circuit Interpretability with Agents" project is trying to solve — using AI agents to automate the human-intensive analysis work. If successful, it would change the scalability picture significantly.
|
|
|
|
**What I expected but didn't find:** A clear answer on whether circuit tracing scales to frontier-scale models (beyond Haiku). Mitra acknowledges the scaling challenge but doesn't document successful scaling results. The answer is: not yet.
|
|
|
|
**KB connections:**
|
|
- [[formal verification of AI-generated proofs provides scalable oversight that human review cannot match]] — circuit tracing is different from formal verification, but Mitra's "representation-level detection" vision is similar in intent
|
|
- [[scalable oversight degrades rapidly as capability gaps grow]] — the "hours per prompt" bottleneck is exactly this degradation
|
|
- [[human-AI mathematical collaboration succeeds through role specialization]] — SPAR's agent-automated circuit tracing is directly applying this pattern to interpretability
|
|
|
|
**Extraction hints:**
|
|
- "Hours per prompt" bottleneck is a specific, citable measurement for the interpretability scaling challenge — use this as evidence in B4-related claims
|
|
- The Anthropic/DeepMind complementarity framing is claim-worthy: "Anthropic's mechanistic circuit tracing and DeepMind's pragmatic interpretability address non-overlapping safety tasks: Anthropic maps causal mechanisms, DeepMind detects harmful intent — together covering more failure modes than either alone"
|
|
- The SPAR agent-automated circuit tracing project is the most direct attempted solution to the hours-per-prompt bottleneck
|
|
|
|
**Context:** Published early 2026, following Anthropic's open-sourcing of circuit tracing tools. Part of Mitra's four-part series on deliberative alignment and technical safety.
|
|
|
|
## Curator Notes
|
|
|
|
PRIMARY CONNECTION: [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]]
|
|
|
|
WHY ARCHIVED: Documents the "hours per prompt" bottleneck as a specific, citable evidence point for interpretability scaling challenges. Also provides the synthesis of Anthropic/DeepMind interpretability divergence.
|
|
|
|
EXTRACTION HINT: Don't extract the whole article as one claim. Three separable claims: (1) hours-per-prompt bottleneck as specific measurement, (2) Anthropic/DeepMind complementarity framing, (3) first documented production deployment decision using interpretability.
|