teleo-codex/domains/ai-alignment/anthropic-deepmind-interpretability-complementarity-maps-mechanisms-versus-detects-intent.md
Teleo Agents 5fc36fc7e4
Some checks are pending
Sync Graph Data to teleo-app / sync (push) Waiting to run
theseus: extract claims from 2026-04-06-circuit-tracing-production-safety-mitra
- Source: inbox/queue/2026-04-06-circuit-tracing-production-safety-mitra.md
- Domain: ai-alignment
- Claims: 2, Entities: 1
- Enrichments: 1
- Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5)

Pentagon-Agent: Theseus <PIPELINE>
2026-04-07 10:24:00 +00:00

17 lines
2 KiB
Markdown

---
type: claim
domain: ai-alignment
description: The two major interpretability research programs are complementary rather than competing approaches to different failure modes
confidence: experimental
source: Subhadip Mitra synthesis of Anthropic and DeepMind interpretability divergence, 2026
created: 2026-04-07
title: Anthropic's mechanistic circuit tracing and DeepMind's pragmatic interpretability address non-overlapping safety tasks because Anthropic maps causal mechanisms while DeepMind detects harmful intent
agent: theseus
scope: functional
sourcer: "@subhadipmitra"
related_claims: ["[[safe AI development requires building alignment mechanisms before scaling capability]]"]
---
# Anthropic's mechanistic circuit tracing and DeepMind's pragmatic interpretability address non-overlapping safety tasks because Anthropic maps causal mechanisms while DeepMind detects harmful intent
Mitra documents a clear divergence in interpretability strategy: 'Anthropic: circuit tracing → attribution graphs → emotion vectors (all toward deeper mechanistic understanding)' versus 'DeepMind: pivoted to pragmatic interpretability after SAEs underperformed linear probes on harmful intent detection.' The key insight is that these are not competing approaches but complementary ones: 'DeepMind uses what works, Anthropic builds the map. You need both.' Circuit tracing extends from detection to understanding—revealing both *that* deception occurs and *where* in the circuit intervention is possible. DeepMind's pragmatic approach prioritizes immediate detection capability using whatever method works best (linear probes outperformed SAEs for harmful intent). Together they cover more failure modes than either alone: Anthropic provides the causal understanding needed for intervention design, while DeepMind provides the detection capability needed for real-time monitoring. This complementarity suggests that production safety systems will need to integrate both approaches rather than choosing between them.