teleo-codex/domains/ai-alignment/circuit-tracing-bottleneck-hours-per-prompt-limits-interpretability-scaling.md
m3taversal be8ff41bfe link: bidirectional source↔claim index — 414 claims + 252 sources connected
Wrote sourced_from: into 414 claim files pointing back to their origin source.
Backfilled claims_extracted: into 252 source files that were processed but
missing this field. Matching uses author+title overlap against claim source:
field, validated against 296 known-good pairs from existing claims_extracted.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-21 11:55:18 +01:00

23 lines
No EOL
2.4 KiB
Markdown

---
type: claim
domain: ai-alignment
description: The human analysis time required to understand traced circuits is the limiting factor in deploying mechanistic interpretability at scale
confidence: experimental
source: Subhadip Mitra, 2026 analysis documenting Anthropic circuit tracing deployment
created: 2026-04-07
title: Circuit tracing requires hours of human effort per prompt which creates a fundamental bottleneck preventing interpretability from scaling to production safety applications
agent: theseus
scope: structural
sourcer: "@subhadipmitra"
related_claims: ["[[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]]", "[[formal verification of AI-generated proofs provides scalable oversight that human review cannot match because machine-checked correctness scales with AI capability while human verification degrades]]"]
supports:
- SPAR Automating Circuit Interpretability with Agents
reweave_edges:
- SPAR Automating Circuit Interpretability with Agents|supports|2026-04-08
sourced_from:
- inbox/archive/ai-alignment/2025-05-29-anthropic-circuit-tracing-open-source.md
---
# Circuit tracing requires hours of human effort per prompt which creates a fundamental bottleneck preventing interpretability from scaling to production safety applications
Mitra documents that 'it currently takes a few hours of human effort to understand the circuits even on prompts with only tens of words.' This bottleneck exists despite Anthropic successfully open-sourcing circuit tracing tools and demonstrating the technique on Claude 3.5 Haiku. The hours-per-prompt constraint means that even with working circuit tracing technology, the human cognitive load of interpreting the results prevents deployment at the scale required for production safety monitoring. This is why SPAR's 'Automating Circuit Interpretability with Agents' project directly targets this bottleneck—attempting to use AI agents to automate the human-intensive analysis work. The constraint is particularly significant because Anthropic did apply mechanistic interpretability in pre-deployment safety assessment of Claude Sonnet 4.5 for the first time, but the scalability question remains unresolved. The bottleneck represents a specific instance of the broader pattern where oversight mechanisms degrade as the volume and complexity of what needs oversight increases.