theseus: extract claims from 2026-04-02-mechanistic-interpretability-state-2026-progress-limits

- Source: inbox/queue/2026-04-02-mechanistic-interpretability-state-2026-progress-limits.md - Domain: ai-alignment - Claims: 2, Entities: 0 - Enrichments: 3 - Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5) Pentagon-Agent: Theseus <PIPELINE>
2026-04-02 10:36:24 +00:00 · 2026-04-02 10:36:24 +00:00 · bb6ad13947
commit bb6ad13947
parent 1ad4d3112e
2 changed files with 34 additions and 0 deletions
--- a/domains/ai-alignment/many-interpretability-queries-are-provably-computationally-intractable.md
+++ b/domains/ai-alignment/many-interpretability-queries-are-provably-computationally-intractable.md
@ -0,0 +1,17 @@
 ---
 type: claim
 domain: ai-alignment
 description: Computational complexity results demonstrate fundamental limits independent of technique improvements or scaling
 confidence: experimental
 source: Consensus open problems paper (29 researchers, 18 organizations, January 2025)
 created: 2026-04-02
 title: Many interpretability queries are provably computationally intractable establishing a theoretical ceiling on mechanistic interpretability as an alignment verification approach
 agent: theseus
 scope: structural
 sourcer: Multiple (Anthropic, Google DeepMind, MIT Technology Review)
 related_claims: ["[[safe AI development requires building alignment mechanisms before scaling capability]]", "[[formal verification of AI-generated proofs provides scalable oversight that human review cannot match because machine-checked correctness scales with AI capability while human verification degrades]]"]
 ---
 # Many interpretability queries are provably computationally intractable establishing a theoretical ceiling on mechanistic interpretability as an alignment verification approach
 The consensus open problems paper from 29 researchers across 18 organizations established that many interpretability queries have been proven computationally intractable through formal complexity analysis. This is distinct from empirical scaling failures — it establishes a theoretical ceiling on what mechanistic interpretability can achieve regardless of technique improvements, computational resources, or research progress. Combined with the lack of rigorous mathematical definitions for core concepts like 'feature,' this creates a two-layer limit: some queries are provably intractable even with perfect definitions, and many current techniques operate on concepts without formal grounding. MIT Technology Review's coverage acknowledged this directly: 'A sobering possibility raised by critics is that there might be fundamental limits to how understandable a highly complex model can be. If an AI develops very alien internal concepts or if its reasoning is distributed in a way that doesn't map onto any simplification a human can grasp, then mechanistic interpretability might hit a wall.' This provides a mechanism for why verification degrades faster than capability grows: the verification problem becomes computationally harder faster than the capability problem becomes computationally harder.
--- a/domains/ai-alignment/mechanistic-interpretability-tools-fail-at-safety-critical-tasks-at-frontier-scale.md
+++ b/domains/ai-alignment/mechanistic-interpretability-tools-fail-at-safety-critical-tasks-at-frontier-scale.md
@ -0,0 +1,17 @@
 ---
 type: claim
 domain: ai-alignment
 description: Google DeepMind's empirical testing found SAEs worse than basic linear probes specifically on the most safety-relevant evaluation target, establishing a capability-safety inversion
 confidence: experimental
 source: Google DeepMind Mechanistic Interpretability Team, 2025 negative SAE results
 created: 2026-04-02
 title: Mechanistic interpretability tools that work at lighter model scales fail on safety-critical tasks at frontier scale because sparse autoencoders underperform simple linear probes on detecting harmful intent
 agent: theseus
 scope: causal
 sourcer: Multiple (Anthropic, Google DeepMind, MIT Technology Review)
 related_claims: ["[[safe AI development requires building alignment mechanisms before scaling capability]]", "[[formal verification of AI-generated proofs provides scalable oversight that human review cannot match because machine-checked correctness scales with AI capability while human verification degrades]]"]
 ---
 # Mechanistic interpretability tools that work at lighter model scales fail on safety-critical tasks at frontier scale because sparse autoencoders underperform simple linear probes on detecting harmful intent
 Google DeepMind's mechanistic interpretability team found that sparse autoencoders (SAEs) — the dominant technique in the field — underperform simple linear probes on detecting harmful intent in user inputs, which is the most safety-relevant task for alignment verification. This is not a marginal performance difference but a fundamental inversion: the more sophisticated interpretability tool performs worse than the baseline. Meanwhile, Anthropic's circuit tracing demonstrated success at Claude 3.5 Haiku scale (identifying two-hop reasoning, poetry planning, multi-step concepts) but provided no evidence of comparable results at larger Claude models. The SAE reconstruction error compounds the problem: replacing GPT-4 activations with 16-million-latent SAE reconstructions degrades performance to approximately 10% of original pretraining compute. This creates a specific mechanism for verification degradation: the tools that enable interpretability at smaller scales either fail to scale or actively degrade the models they're meant to interpret at frontier scale. DeepMind's response was to pivot from dedicated SAE research to 'pragmatic interpretability' — using whatever technique works for specific safety-critical tasks, abandoning the ambitious reverse-engineering approach.