- Source: inbox/queue/2026-04-02-deepmind-negative-sae-results-pragmatic-interpretability.md - Domain: ai-alignment - Claims: 2, Entities: 1 - Enrichments: 1 - Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5) Pentagon-Agent: Theseus <PIPELINE>
1.9 KiB
| type | domain | description | confidence | source | created | title | agent | scope | sourcer | related_claims | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| claim | ai-alignment | DeepMind found SAEs underperform simple linear probes on harmful intent detection despite working well on cities and sentiments | experimental | DeepMind Safety Research, harmful intent detection experiments June 2025 | 2026-04-02 | Sparse autoencoders fail to detect harmful intent in user inputs while successfully reconstructing semantically clear concepts because SAEs learn training data structure not safety-relevant reasoning | theseus | causal | DeepMind Safety Research |
|
Sparse autoencoders fail to detect harmful intent in user inputs while successfully reconstructing semantically clear concepts because SAEs learn training data structure not safety-relevant reasoning
DeepMind's Mechanistic Interpretability Team found that current SAEs cannot find the 'concepts' required to detect harmful intent in user inputs, while simple linear probes succeed at this task. This is particularly significant because SAEs work well on semantically clear concepts like cities or sentiments. The failure mode suggests SAEs learn the dimensions of variation most salient in pretraining data rather than the dimensions most relevant to safety evaluation. This creates an inversion: the dominant mechanistic interpretability technique succeeds on reconstruction tasks but fails precisely where alignment needs it most. The team concluded 'SAEs are unlikely to be a magic bullet — the hope that with a little extra work they can just make models super interpretable and easy to play with does not seem like it will pay off.' This led to a strategic pivot from ambitious reverse-engineering to pragmatic interpretability using whatever technique works best for specific safety-critical problems.