Sync Graph Data to teleo-app / sync (push) Waiting to run

Details

theseus: extract claims from 2026-04-02-mechanistic-interpretability-state-2026-progress-limits

- Source: inbox/queue/2026-04-02-mechanistic-interpretability-state-2026-progress-limits.md
- Domain: ai-alignment
- Claims: 2, Entities: 0
- Enrichments: 3
- Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5)

Pentagon-Agent: Theseus <PIPELINE>

2026-04-02 10:37:38 +00:00

2.4 KiB

Raw Blame History

type

domain

description

confidence

source

created

title

agent

scope

sourcer

related_claims

claim

ai-alignment

Google DeepMind's empirical testing found SAEs worse than basic linear probes specifically on the most safety-relevant evaluation target, establishing a capability-safety inversion

experimental

Google DeepMind Mechanistic Interpretability Team, 2025 negative SAE results

2026-04-02

Mechanistic interpretability tools that work at lighter model scales fail on safety-critical tasks at frontier scale because sparse autoencoders underperform simple linear probes on detecting harmful intent

theseus

causal

Multiple (Anthropic, Google DeepMind, MIT Technology Review)

safe AI development requires building alignment mechanisms before scaling capability

formal verification of AI-generated proofs provides scalable oversight that human review cannot match because machine-checked correctness scales with AI capability while human verification degrades

Mechanistic interpretability tools that work at lighter model scales fail on safety-critical tasks at frontier scale because sparse autoencoders underperform simple linear probes on detecting harmful intent

Google DeepMind's mechanistic interpretability team found that sparse autoencoders (SAEs) — the dominant technique in the field — underperform simple linear probes on detecting harmful intent in user inputs, which is the most safety-relevant task for alignment verification. This is not a marginal performance difference but a fundamental inversion: the more sophisticated interpretability tool performs worse than the baseline. Meanwhile, Anthropic's circuit tracing demonstrated success at Claude 3.5 Haiku scale (identifying two-hop reasoning, poetry planning, multi-step concepts) but provided no evidence of comparable results at larger Claude models. The SAE reconstruction error compounds the problem: replacing GPT-4 activations with 16-million-latent SAE reconstructions degrades performance to approximately 10% of original pretraining compute. This creates a specific mechanism for verification degradation: the tools that enable interpretability at smaller scales either fail to scale or actively degrade the models they're meant to interpret at frontier scale. DeepMind's response was to pivot from dedicated SAE research to 'pragmatic interpretability' — using whatever technique works for specific safety-critical tasks, abandoning the ambitious reverse-engineering approach.

2.4 KiB Raw Blame History

Mechanistic interpretability tools that work at lighter model scales fail on safety-critical tasks at frontier scale because sparse autoencoders underperform simple linear probes on detecting harmful intent

2.4 KiB

Raw Blame History