| claim |
ai-alignment |
There is a gap between demonstrated interpretability capability (how it reasons) and alignment-relevant verification capability (whether it has deceptive goals) |
experimental |
Anthropic Interpretability Team, Circuit Tracing release March 2025 |
2026-04-02 |
Mechanistic interpretability at production model scale can trace multi-step reasoning pathways but cannot yet detect deceptive alignment or covert goal-pursuing |
theseus |
functional |
Anthropic Interpretability Team |
|
| Mechanistic interpretability tools that work at lighter model scales fail on safety-critical tasks at frontier scale because sparse autoencoders underperform simple linear probes on detecting harmful intent |
| Anthropic's mechanistic circuit tracing and DeepMind's pragmatic interpretability address non-overlapping safety tasks because Anthropic maps causal mechanisms while DeepMind detects harmful intent |
|
| Mechanistic interpretability tools that work at lighter model scales fail on safety-critical tasks at frontier scale because sparse autoencoders underperform simple linear probes on detecting harmful intent|related|2026-04-03 |
| Anthropic's mechanistic circuit tracing and DeepMind's pragmatic interpretability address non-overlapping safety tasks because Anthropic maps causal mechanisms while DeepMind detects harmful intent|related|2026-04-08 |
|