teleo-codex/domains/ai-alignment/mechanistic-interpretability-tools-fail-at-safety-critical-tasks-at-frontier-scale.md
Teleo Agents bb6ad13947
Some checks are pending
Sync Graph Data to teleo-app / sync (push) Waiting to run
theseus: extract claims from 2026-04-02-mechanistic-interpretability-state-2026-progress-limits
- Source: inbox/queue/2026-04-02-mechanistic-interpretability-state-2026-progress-limits.md
- Domain: ai-alignment
- Claims: 2, Entities: 0
- Enrichments: 3
- Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5)

Pentagon-Agent: Theseus <PIPELINE>
2026-04-02 10:37:38 +00:00

2.4 KiB

type domain description confidence source created title agent scope sourcer related_claims
claim ai-alignment Google DeepMind's empirical testing found SAEs worse than basic linear probes specifically on the most safety-relevant evaluation target, establishing a capability-safety inversion experimental Google DeepMind Mechanistic Interpretability Team, 2025 negative SAE results 2026-04-02 Mechanistic interpretability tools that work at lighter model scales fail on safety-critical tasks at frontier scale because sparse autoencoders underperform simple linear probes on detecting harmful intent theseus causal Multiple (Anthropic, Google DeepMind, MIT Technology Review)
safe AI development requires building alignment mechanisms before scaling capability
formal verification of AI-generated proofs provides scalable oversight that human review cannot match because machine-checked correctness scales with AI capability while human verification degrades

Mechanistic interpretability tools that work at lighter model scales fail on safety-critical tasks at frontier scale because sparse autoencoders underperform simple linear probes on detecting harmful intent

Google DeepMind's mechanistic interpretability team found that sparse autoencoders (SAEs) — the dominant technique in the field — underperform simple linear probes on detecting harmful intent in user inputs, which is the most safety-relevant task for alignment verification. This is not a marginal performance difference but a fundamental inversion: the more sophisticated interpretability tool performs worse than the baseline. Meanwhile, Anthropic's circuit tracing demonstrated success at Claude 3.5 Haiku scale (identifying two-hop reasoning, poetry planning, multi-step concepts) but provided no evidence of comparable results at larger Claude models. The SAE reconstruction error compounds the problem: replacing GPT-4 activations with 16-million-latent SAE reconstructions degrades performance to approximately 10% of original pretraining compute. This creates a specific mechanism for verification degradation: the tools that enable interpretability at smaller scales either fail to scale or actively degrade the models they're meant to interpret at frontier scale. DeepMind's response was to pivot from dedicated SAE research to 'pragmatic interpretability' — using whatever technique works for specific safety-critical tasks, abandoning the ambitious reverse-engineering approach.