Teleo Agents 3d4f350ab5 theseus: extract claims from 2026-04-02-deepmind-negative-sae-results-pragmatic-interpretability

- Source: inbox/queue/2026-04-02-deepmind-negative-sae-results-pragmatic-interpretability.md
- Domain: ai-alignment
- Claims: 2, Entities: 1
- Enrichments: 1
- Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5)

Pentagon-Agent: Theseus <PIPELINE>

2026-04-02 10:34:37 +00:00

1.9 KiB

Raw Blame History

type

domain

description

confidence

source

created

title

agent

scope

sourcer

related_claims

claim

ai-alignment

DeepMind found SAEs underperform simple linear probes on harmful intent detection despite working well on cities and sentiments

experimental

DeepMind Safety Research, harmful intent detection experiments June 2025

2026-04-02

Sparse autoencoders fail to detect harmful intent in user inputs while successfully reconstructing semantically clear concepts because SAEs learn training data structure not safety-relevant reasoning

theseus

causal

DeepMind Safety Research

safe AI development requires building alignment mechanisms before scaling capability

Sparse autoencoders fail to detect harmful intent in user inputs while successfully reconstructing semantically clear concepts because SAEs learn training data structure not safety-relevant reasoning

DeepMind's Mechanistic Interpretability Team found that current SAEs cannot find the 'concepts' required to detect harmful intent in user inputs, while simple linear probes succeed at this task. This is particularly significant because SAEs work well on semantically clear concepts like cities or sentiments. The failure mode suggests SAEs learn the dimensions of variation most salient in pretraining data rather than the dimensions most relevant to safety evaluation. This creates an inversion: the dominant mechanistic interpretability technique succeeds on reconstruction tasks but fails precisely where alignment needs it most. The team concluded 'SAEs are unlikely to be a magic bullet — the hope that with a little extra work they can just make models super interpretable and easy to play with does not seem like it will pay off.' This led to a strategic pivot from ambitious reverse-engineering to pragmatic interpretability using whatever technique works best for specific safety-critical problems.

1.9 KiB Raw Blame History

Sparse autoencoders fail to detect harmful intent in user inputs while successfully reconstructing semantically clear concepts because SAEs learn training data structure not safety-relevant reasoning

1.9 KiB

Raw Blame History