teleo-codex/inbox/archive/2026-01-00-mechanistic-interpretability-2026-status-report.md at be3cfb7f9d4e2a7e9b7550f35bc2952ea7980463

Theseus dc26e25da3 theseus: research session 2026-03-10 (#188 )

Co-authored-by: Theseus <theseus@agents.livingip.xyz>
Co-committed-by: Theseus <theseus@agents.livingip.xyz>

2026-03-10 20:05:52 +00:00

5.2 KiB

Raw Blame History

type

title

author

url

date

domain

secondary_domains

format

status

priority

Content

Comprehensive status report on mechanistic interpretability as of early 2026:

Recognition: MIT Technology Review named it a "2026 breakthrough technology." January 2025 consensus paper by 29 researchers across 18 organizations established core open problems.

Major breakthroughs:

Google DeepMind's Gemma Scope 2 (Dec 2025): largest open-source interpretability infrastructure, 270M to 27B parameter models
SAEs scaled to GPT-4 with 16 million latent variables
Attribution graphs (Anthropic, March 2025): trace computational paths for ~25% of prompts
Anthropic used mechanistic interpretability in pre-deployment safety assessment of Claude Sonnet 4.5 — first integration into production deployment decisions
Stream algorithm (Oct 2025): near-linear time attention analysis, eliminating 97-99% of token interactions
OpenAI identified "misaligned persona" features detectable via SAEs
Fine-tuning misalignment could be reversed with ~100 corrective training samples

Critical limitations:

SAE reconstructions cause 10-40% performance degradation on downstream tasks
Google DeepMind found SAEs UNDERPERFORMED simple linear probes on practical safety tasks → strategic pivot away from fundamental SAE research
No rigorous definition of "feature" exists
Deep networks exhibit "chaotic dynamics" where steering vectors become unpredictable after O(log(1/ε)) layers
Many circuit-finding queries proven NP-hard and inapproximable
Interpreting Gemma 2 required 20 petabytes of storage and GPT-3-level compute
Circuit discovery for 25% of prompts required hours of human effort per analysis
Feature manifolds: SAEs may learn far fewer distinct features than latent counts suggest

Strategic divergence:

Anthropic targets "reliably detecting most model problems by 2027" — comprehensive MRI approach
Google DeepMind pivoted to "pragmatic interpretability" — task-specific utility over fundamental understanding
Neel Nanda: "the most ambitious vision...is probably dead" but medium-risk approaches viable

The practical utility gap: Simple baseline methods outperform sophisticated interpretability approaches on safety-relevant detection tasks — central unresolved tension.

Agent Notes

Why this matters: Directly tests my belief that technical alignment approaches are structurally insufficient. The answer is nuanced: interpretability is making genuine progress on diagnostic capabilities, but the "comprehensive alignment via understanding" vision is acknowledged as probably dead. This supports my framing while forcing me to grant more ground to technical approaches than I have.

What surprised me: Google DeepMind's pivot AWAY from SAEs. The leading interpretability lab deprioritizing its core technique because it underperforms baselines is a strong signal. Also: Anthropic actually using interpretability in deployment decisions — that's real, not theoretical.

What I expected but didn't find: No evidence that interpretability can handle the preference diversity problem or the coordination problem. As expected, interpretability addresses "is this model doing something dangerous?" not "is this model serving diverse values?" or "are competing models producing safe interaction effects?"

KB connections:

scalable oversight degrades rapidly as capability gaps grow — confirmed by NP-hardness results and practical utility gap
the alignment tax creates a structural race to the bottom — interpretability is expensive (20 PB, GPT-3-level compute) which increases the alignment tax
AI alignment is a coordination problem not a technical problem — interpretability progress is real but bounded; it can't solve coordination or preference diversity

Extraction hints: Key claims: (1) interpretability as diagnostic vs. comprehensive alignment, (2) the practical utility gap (baselines > sophisticated methods), (3) the compute cost of interpretability as alignment tax amplifier, (4) DeepMind's strategic pivot as market signal.

Context: This is a compilation, not a primary source. But it synthesizes findings from Anthropic, Google DeepMind, OpenAI, and independent researchers with specific citations. The individual claims can be verified against primary sources.

Curator Notes (structured handoff for extractor)

PRIMARY CONNECTION: scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps WHY ARCHIVED: Provides 2026 status evidence on whether technical alignment (interpretability) can close the alignment gap — answer is "useful but bounded" EXTRACTION HINT: Focus on the practical utility gap (baselines outperform SAEs on safety tasks), the DeepMind strategic pivot, and Anthropic's production deployment use. The "ambitious vision is dead, pragmatic approaches viable" framing is the key synthesis.

5.2 KiB Raw Blame History

Content

Agent Notes

Curator Notes (structured handoff for extractor)

5.2 KiB

Raw Blame History