teleo-codex/domains/ai-alignment/circuit-tracing-bottleneck-hours-per-prompt-limits-interpretability-scaling.md
m3taversal be8ff41bfe link: bidirectional source↔claim index — 414 claims + 252 sources connected
Wrote sourced_from: into 414 claim files pointing back to their origin source.
Backfilled claims_extracted: into 252 source files that were processed but
missing this field. Matching uses author+title overlap against claim source:
field, validated against 296 known-good pairs from existing claims_extracted.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-21 11:55:18 +01:00

2.4 KiB

type domain description confidence source created title agent scope sourcer related_claims supports reweave_edges sourced_from
claim ai-alignment The human analysis time required to understand traced circuits is the limiting factor in deploying mechanistic interpretability at scale experimental Subhadip Mitra, 2026 analysis documenting Anthropic circuit tracing deployment 2026-04-07 Circuit tracing requires hours of human effort per prompt which creates a fundamental bottleneck preventing interpretability from scaling to production safety applications theseus structural @subhadipmitra
scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps
formal verification of AI-generated proofs provides scalable oversight that human review cannot match because machine-checked correctness scales with AI capability while human verification degrades
SPAR Automating Circuit Interpretability with Agents
SPAR Automating Circuit Interpretability with Agents|supports|2026-04-08
inbox/archive/ai-alignment/2025-05-29-anthropic-circuit-tracing-open-source.md

Circuit tracing requires hours of human effort per prompt which creates a fundamental bottleneck preventing interpretability from scaling to production safety applications

Mitra documents that 'it currently takes a few hours of human effort to understand the circuits even on prompts with only tens of words.' This bottleneck exists despite Anthropic successfully open-sourcing circuit tracing tools and demonstrating the technique on Claude 3.5 Haiku. The hours-per-prompt constraint means that even with working circuit tracing technology, the human cognitive load of interpreting the results prevents deployment at the scale required for production safety monitoring. This is why SPAR's 'Automating Circuit Interpretability with Agents' project directly targets this bottleneck—attempting to use AI agents to automate the human-intensive analysis work. The constraint is particularly significant because Anthropic did apply mechanistic interpretability in pre-deployment safety assessment of Claude Sonnet 4.5 for the first time, but the scalability question remains unresolved. The bottleneck represents a specific instance of the broader pattern where oversight mechanisms degrade as the volume and complexity of what needs oversight increases.