theseus: extract claims from 2026-04-02-deepmind-negative-sae-results-pragmatic-interpretability
- Source: inbox/queue/2026-04-02-deepmind-negative-sae-results-pragmatic-interpretability.md - Domain: ai-alignment - Claims: 2, Entities: 1 - Enrichments: 1 - Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5) Pentagon-Agent: Theseus <PIPELINE>
This commit is contained in:
parent
6bc5637259
commit
3d4f350ab5
3 changed files with 52 additions and 0 deletions
|
|
@ -0,0 +1,16 @@
|
|||
---
|
||||
type: claim
|
||||
domain: ai-alignment
|
||||
description: "Replacing GPT-4 activations with 16-million-latent SAE reconstructions loses 90% of performance"
|
||||
confidence: experimental
|
||||
source: DeepMind Safety Research, GPT-4 reconstruction experiments June 2025
|
||||
created: 2026-04-02
|
||||
title: SAE reconstruction degrades model performance equivalent to 90 percent of original pretraining compute establishing a fundamental fidelity ceiling for interpretability through reconstruction
|
||||
agent: theseus
|
||||
scope: causal
|
||||
sourcer: DeepMind Safety Research
|
||||
---
|
||||
|
||||
# SAE reconstruction degrades model performance equivalent to 90 percent of original pretraining compute establishing a fundamental fidelity ceiling for interpretability through reconstruction
|
||||
|
||||
DeepMind found that replacing GPT-4 activations with reconstructions from a 16-million-latent SAE degrades performance to roughly 10% of original pretraining compute — a 90% performance loss from SAE reconstruction alone. This establishes a fundamental ceiling on the fidelity of interpretability approaches based on reconstruction. Even with massive SAE capacity (16 million latents), the reconstruction cannot preserve the information necessary for model performance. This suggests that interpretability through decomposition may face inherent limits: the act of breaking down activations into interpretable components necessarily loses information critical to model function. The magnitude of degradation (90%) indicates this is not a scaling problem that larger SAEs will solve, but a fundamental trade-off between interpretability and fidelity.
|
||||
|
|
@ -0,0 +1,17 @@
|
|||
---
|
||||
type: claim
|
||||
domain: ai-alignment
|
||||
description: DeepMind found SAEs underperform simple linear probes on harmful intent detection despite working well on cities and sentiments
|
||||
confidence: experimental
|
||||
source: DeepMind Safety Research, harmful intent detection experiments June 2025
|
||||
created: 2026-04-02
|
||||
title: Sparse autoencoders fail to detect harmful intent in user inputs while successfully reconstructing semantically clear concepts because SAEs learn training data structure not safety-relevant reasoning
|
||||
agent: theseus
|
||||
scope: causal
|
||||
sourcer: DeepMind Safety Research
|
||||
related_claims: ["[[safe AI development requires building alignment mechanisms before scaling capability]]"]
|
||||
---
|
||||
|
||||
# Sparse autoencoders fail to detect harmful intent in user inputs while successfully reconstructing semantically clear concepts because SAEs learn training data structure not safety-relevant reasoning
|
||||
|
||||
DeepMind's Mechanistic Interpretability Team found that current SAEs cannot find the 'concepts' required to detect harmful intent in user inputs, while simple linear probes succeed at this task. This is particularly significant because SAEs work well on semantically clear concepts like cities or sentiments. The failure mode suggests SAEs learn the dimensions of variation most salient in pretraining data rather than the dimensions most relevant to safety evaluation. This creates an inversion: the dominant mechanistic interpretability technique succeeds on reconstruction tasks but fails precisely where alignment needs it most. The team concluded 'SAEs are unlikely to be a magic bullet — the hope that with a little extra work they can just make models super interpretable and easy to play with does not seem like it will pay off.' This led to a strategic pivot from ambitious reverse-engineering to pragmatic interpretability using whatever technique works best for specific safety-critical problems.
|
||||
19
entities/ai-alignment/gemma-scope-2.md
Normal file
19
entities/ai-alignment/gemma-scope-2.md
Normal file
|
|
@ -0,0 +1,19 @@
|
|||
# Gemma Scope 2
|
||||
|
||||
**Type:** Research Program
|
||||
**Organization:** Google DeepMind
|
||||
**Domain:** AI Alignment / Mechanistic Interpretability
|
||||
**Status:** Active (announced June 2025, release December 2025)
|
||||
|
||||
## Overview
|
||||
|
||||
Gemma Scope 2 is a full-stack interpretability suite for Gemma 3 models ranging from 270M to 27B parameters, producing approximately 110 petabytes of activation data.
|
||||
|
||||
## Strategic Context
|
||||
|
||||
Released alongside DeepMind's announcement of negative SAE results and pivot to pragmatic interpretability. The continued investment in interpretability infrastructure (despite deprioritizing SAE research specifically) signals that DeepMind remains committed to mechanistic interpretability tooling while becoming more selective about which techniques to pursue.
|
||||
|
||||
## Timeline
|
||||
|
||||
- **2025-06-01** — Announced in DeepMind's negative SAE results post
|
||||
- **2025-12-XX** — Planned release date
|
||||
Loading…
Reference in a new issue