diff --git a/domains/ai-alignment/AI alignment is a coordination problem not a technical problem.md b/domains/ai-alignment/AI alignment is a coordination problem not a technical problem.md index 093867dee..eeddd1979 100644 --- a/domains/ai-alignment/AI alignment is a coordination problem not a technical problem.md +++ b/domains/ai-alignment/AI alignment is a coordination problem not a technical problem.md @@ -21,6 +21,12 @@ Dario Amodei describes AI as "so powerful, such a glittering prize, that it is v Since [[the internet enabled global communication but not global cognition]], the coordination infrastructure needed doesn't exist yet. This is why [[collective superintelligence is the alternative to monolithic AI controlled by a few]] -- it solves alignment through architecture rather than attempting governance from outside the system. + +### Additional Evidence (extend) +*Source: [[2026-01-00-mechanistic-interpretability-2026-status-report]] | Added: 2026-03-11 | Extractor: anthropic/claude-sonnet-4.5* + +Mechanistic interpretability progress by 2026 demonstrates the bounded nature of technical approaches: interpretability achieved genuine diagnostic capability (Anthropic integrated it into Claude Sonnet 4.5 deployment decisions, attribution graphs trace 25% of prompts), but the field explicitly abandoned the comprehensive alignment vision (Neel Nanda: 'the most ambitious vision...is probably dead'). Interpretability addresses 'is this model doing something dangerous?' but provides no framework for preference diversity (no evidence it can handle 'is this model serving diverse values?') or coordination problems (no evidence it addresses 'are competing models producing safe interaction effects?'). The Google DeepMind pivot to 'pragmatic interpretability' over fundamental understanding confirms that technical tools optimize for detection, not coordination or value alignment. This supports the claim that technical approaches cannot solve the coordination problem: interpretability can diagnose individual model failures but cannot coordinate safety across competing labs or ensure diverse values are preserved. + --- Relevant Notes: diff --git a/domains/ai-alignment/alignment-tax-amplified-by-interpretability-compute-costs.md b/domains/ai-alignment/alignment-tax-amplified-by-interpretability-compute-costs.md new file mode 100644 index 000000000..dc0451ece --- /dev/null +++ b/domains/ai-alignment/alignment-tax-amplified-by-interpretability-compute-costs.md @@ -0,0 +1,36 @@ +--- +type: claim +domain: ai-alignment +description: "Mechanistic interpretability infrastructure costs (20 petabytes storage, GPT-3-level compute per model) amplify the alignment tax and create competitive pressure to skip safety analysis" +confidence: likely +source: "Google DeepMind Gemma 2 interpretability analysis (2025-2026), bigsnarfdude compilation" +created: 2026-03-11 +--- + +# The alignment tax is amplified by interpretability compute costs because comprehensive analysis requires infrastructure-scale resources + +Mechanistic interpretability imposes massive computational costs that amplify the alignment tax beyond training-time safety constraints. Interpreting Gemma 2 required 20 petabytes of storage and GPT-3-level compute — infrastructure costs comparable to training the model itself. + +This creates a structural competitive disadvantage for safety-conscious development. Labs that perform comprehensive interpretability analysis bear costs that competitors can skip. The alignment tax operates at two levels: (1) capability degradation from safety constraints during training (10-40% performance loss from SAE reconstructions), and (2) infrastructure costs for post-training interpretability analysis. + +The Google DeepMind strategic pivot demonstrates the market pressure. When SAEs underperformed simple linear probes on practical safety tasks, DeepMind deprioritized fundamental SAE research — a rational response to high costs and limited practical utility. The most resource-intensive interpretability approaches are being abandoned in favor of cheaper baselines. + +Circuit discovery compounds the problem: analyzing 25% of prompts required hours of human effort per analysis. This labor cost is non-scalable and creates a coverage gap where only a small fraction of model behavior receives interpretability scrutiny. + +The result is a race-to-the-bottom dynamic where interpretability becomes a luxury good — performed by well-resourced labs on flagship models but skipped in competitive deployment scenarios. + +## Evidence +- Interpreting Gemma 2 required 20 petabytes of storage and GPT-3-level compute +- SAE reconstructions cause 10-40% performance degradation on downstream tasks +- Circuit discovery for 25% of prompts required hours of human effort per analysis +- Google DeepMind pivoted away from fundamental SAE research when costs exceeded practical utility +- Stream algorithm (October 2025) eliminates 97-99% of token interactions, reducing compute requirements + +--- + +Relevant Notes: +- [[the alignment tax creates a structural race to the bottom because safety training costs capability and rational competitors skip it]] — interpretability adds infrastructure costs to the existing capability tax +- [[voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints]] — interpretability costs create unilateral disadvantage + +Topics: +- [[domains/ai-alignment/_map]] diff --git a/domains/ai-alignment/anthropic-integrated-interpretability-into-production-deployment-decisions.md b/domains/ai-alignment/anthropic-integrated-interpretability-into-production-deployment-decisions.md new file mode 100644 index 000000000..14499d0af --- /dev/null +++ b/domains/ai-alignment/anthropic-integrated-interpretability-into-production-deployment-decisions.md @@ -0,0 +1,34 @@ +--- +type: claim +domain: ai-alignment +description: "Anthropic integrated mechanistic interpretability into Claude Sonnet 4.5 pre-deployment safety assessment, marking the first production deployment decision informed by interpretability analysis" +confidence: proven +source: "Anthropic Claude Sonnet 4.5 deployment (2025), bigsnarfdude compilation" +created: 2026-03-11 +--- + +# Anthropic integrated mechanistic interpretability into production deployment decisions with Claude Sonnet 4.5 pre-deployment safety assessment + +Anthropic used mechanistic interpretability in the pre-deployment safety assessment of Claude Sonnet 4.5, marking the first integration of interpretability analysis into production deployment decisions. This represents a transition from research tool to operational safety infrastructure. + +The significance is not the interpretability capability itself (attribution graphs, SAE analysis) but the organizational integration: interpretability analysis became a gate in the deployment pipeline, not a post-hoc research exercise. This creates institutional commitment — deployment decisions now depend on interpretability results. + +Anthropic's strategic direction is "reliably detecting most model problems by 2027" through comprehensive diagnostic coverage (MRI approach). This positions interpretability as a detection layer rather than a comprehensive understanding framework. + +The practical implementation remains limited: attribution graphs trace computational paths for approximately 25% of prompts, and circuit discovery requires hours of human effort per analysis. But the organizational precedent is established — interpretability is now part of the deployment decision process at a frontier lab. + +## Evidence +- Anthropic used mechanistic interpretability in pre-deployment safety assessment of Claude Sonnet 4.5 +- First integration of interpretability into production deployment decisions +- Anthropic targets "reliably detecting most model problems by 2027" +- Attribution graphs (March 2025) trace computational paths for ~25% of prompts +- Circuit discovery for 25% of prompts required hours of human effort per analysis + +--- + +Relevant Notes: +- [[mechanistic-interpretability-diagnostic-capability-proven-but-comprehensive-alignment-vision-abandoned]] — Anthropic's MRI approach is diagnostic, not comprehensive understanding +- [[safe AI development requires building alignment mechanisms before scaling capability]] — interpretability as deployment gate implements this principle + +Topics: +- [[domains/ai-alignment/_map]] diff --git a/domains/ai-alignment/google-deepmind-pivot-from-saes-signals-practical-utility-failure.md b/domains/ai-alignment/google-deepmind-pivot-from-saes-signals-practical-utility-failure.md new file mode 100644 index 000000000..95e55f72d --- /dev/null +++ b/domains/ai-alignment/google-deepmind-pivot-from-saes-signals-practical-utility-failure.md @@ -0,0 +1,36 @@ +--- +type: claim +domain: ai-alignment +description: "Google DeepMind deprioritized SAE research when SAEs underperformed simple linear probes on practical safety tasks, signaling that sophisticated interpretability methods fail on utility grounds" +confidence: likely +source: "Google DeepMind strategic pivot (2025-2026), bigsnarfdude compilation" +created: 2026-03-11 +--- + +# Google DeepMind's pivot away from SAEs signals that sophisticated interpretability underperforms simple baselines on practical safety tasks + +Google DeepMind's strategic pivot away from fundamental SAE (Sparse Autoencoder) research represents a critical market signal: the leading interpretability lab deprioritized its core technique because SAEs underperformed simple linear probes on practical safety tasks. + +This is not a capability failure — DeepMind built Gemma Scope 2, the largest open-source interpretability infrastructure (270M to 27B parameter models), and scaled SAEs to GPT-4 with 16 million latent variables. The technical capability exists. The pivot occurred because sophisticated interpretability methods delivered less practical safety utility than simpler alternatives. + +The practical utility gap is the central tension: simple baseline methods outperform sophisticated interpretability approaches on safety-relevant detection tasks. When the most resource-intensive methods underperform cheap baselines, rational labs shift resources toward pragmatic approaches. + +DeepMind's new direction is "pragmatic interpretability" — task-specific utility over fundamental understanding. This represents a philosophical shift from "understand the model comprehensively" to "detect specific safety-relevant behaviors efficiently." + +The market dynamics are clear: if the lab with the most interpretability expertise and resources concludes that SAEs are not the path to practical safety, other labs will follow. The field is converging on diagnostic tools (Anthropic's MRI approach) rather than comprehensive mechanistic understanding. + +## Evidence +- Google DeepMind found SAEs underperformed simple linear probes on practical safety tasks +- DeepMind pivoted to "pragmatic interpretability" prioritizing task-specific utility over fundamental understanding +- Gemma Scope 2 (December 2025): largest open-source interpretability infrastructure, 270M to 27B parameter models +- SAEs scaled to GPT-4 with 16 million latent variables +- SAE reconstructions cause 10-40% performance degradation on downstream tasks + +--- + +Relevant Notes: +- [[mechanistic-interpretability-diagnostic-capability-proven-but-comprehensive-alignment-vision-abandoned]] — DeepMind pivot is part of broader field convergence +- [[alignment-tax-amplified-by-interpretability-compute-costs]] — high costs with limited utility drove the strategic shift + +Topics: +- [[domains/ai-alignment/_map]] diff --git a/domains/ai-alignment/mechanistic-interpretability-diagnostic-capability-proven-but-comprehensive-alignment-vision-abandoned.md b/domains/ai-alignment/mechanistic-interpretability-diagnostic-capability-proven-but-comprehensive-alignment-vision-abandoned.md new file mode 100644 index 000000000..1a9c92d92 --- /dev/null +++ b/domains/ai-alignment/mechanistic-interpretability-diagnostic-capability-proven-but-comprehensive-alignment-vision-abandoned.md @@ -0,0 +1,43 @@ +--- +type: claim +domain: ai-alignment +description: "Mechanistic interpretability achieved diagnostic capability by 2026 but the comprehensive alignment-through-understanding vision was explicitly abandoned by leading labs" +confidence: likely +source: "Neel Nanda quoted in bigsnarfdude compilation; Google DeepMind strategic pivot; Anthropic deployment integration (2025-2026)" +created: 2026-03-11 +--- + +# Mechanistic interpretability achieved diagnostic capability but the comprehensive alignment vision is acknowledged as probably dead + +By early 2026, the mechanistic interpretability field reached a strategic inflection point: genuine progress on diagnostic capabilities combined with explicit abandonment of the most ambitious alignment vision. Neel Nanda's assessment that "the most ambitious vision...is probably dead" while medium-risk approaches remain viable captures the field's consensus. + +The diagnostic capability is real and deployed. Anthropic integrated mechanistic interpretability into pre-deployment safety assessment of Claude Sonnet 4.5 — the first production deployment decision informed by interpretability analysis. Attribution graphs can trace computational paths for approximately 25% of prompts. OpenAI identified "misaligned persona" features detectable via SAEs, and fine-tuning misalignment could be reversed with roughly 100 corrective training samples. + +But the comprehensive vision failed on practical grounds. Google DeepMind's strategic pivot away from fundamental SAE research is the strongest signal: SAEs underperformed simple linear probes on practical safety tasks, causing the leading interpretability lab to deprioritize its core technique. SAE reconstructions cause 10-40% performance degradation on downstream tasks. The practical utility gap — simple baseline methods outperforming sophisticated interpretability approaches on safety-relevant detection — remains the central unresolved tension. + +The theoretical barriers are equally severe. No rigorous definition of "feature" exists. Deep networks exhibit chaotic dynamics where steering vectors become unpredictable after O(log(1/ε)) layers. Many circuit-finding queries are proven NP-hard and inapproximable. Interpreting Gemma 2 required 20 petabytes of storage and GPT-3-level compute — costs that amplify the alignment tax. + +The field diverged into two camps: Anthropic targets "reliably detecting most model problems by 2027" through comprehensive diagnostic coverage (MRI approach), while Google DeepMind pivoted to "pragmatic interpretability" prioritizing task-specific utility over fundamental understanding. + +## Evidence +- MIT Technology Review named mechanistic interpretability a "2026 breakthrough technology" +- January 2025 consensus paper by 29 researchers across 18 organizations established core open problems +- Google DeepMind's Gemma Scope 2 (December 2025): largest open-source interpretability infrastructure, 270M to 27B parameter models +- SAEs scaled to GPT-4 with 16 million latent variables +- Anthropic's attribution graphs (March 2025) trace computational paths for ~25% of prompts +- Stream algorithm (October 2025): near-linear time attention analysis, eliminating 97-99% of token interactions +- Circuit discovery for 25% of prompts required hours of human effort per analysis +- Feature manifolds: SAEs may learn far fewer distinct features than latent counts suggest + +## Challenges +The practical utility gap persists: baselines outperform sophisticated methods on safety tasks. This suggests interpretability may be solving the wrong problem — optimizing for mechanistic understanding rather than safety-relevant detection. + +--- + +Relevant Notes: +- [[AI alignment is a coordination problem not a technical problem]] — interpretability progress is real but bounded to diagnostic capability +- [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]] — NP-hardness results and practical utility gap confirm oversight degradation +- [[safe AI development requires building alignment mechanisms before scaling capability]] — interpretability as diagnostic enables this but cannot guarantee it + +Topics: +- [[domains/ai-alignment/_map]] diff --git a/inbox/archive/2026-01-00-mechanistic-interpretability-2026-status-report.md b/inbox/archive/2026-01-00-mechanistic-interpretability-2026-status-report.md index f6fcabba1..d86798abf 100644 --- a/inbox/archive/2026-01-00-mechanistic-interpretability-2026-status-report.md +++ b/inbox/archive/2026-01-00-mechanistic-interpretability-2026-status-report.md @@ -7,9 +7,15 @@ date: 2026-01-01 domain: ai-alignment secondary_domains: [] format: report -status: unprocessed +status: processed priority: high tags: [mechanistic-interpretability, SAE, safety, technical-alignment, limitations, DeepMind-pivot] +processed_by: theseus +processed_date: 2026-03-11 +claims_extracted: ["mechanistic-interpretability-diagnostic-capability-proven-but-comprehensive-alignment-vision-abandoned.md", "alignment-tax-amplified-by-interpretability-compute-costs.md", "google-deepmind-pivot-from-saes-signals-practical-utility-failure.md", "anthropic-integrated-interpretability-into-production-deployment-decisions.md"] +enrichments_applied: ["AI alignment is a coordination problem not a technical problem.md"] +extraction_model: "anthropic/claude-sonnet-4.5" +extraction_notes: "Source is a compilation from multiple primary sources (Anthropic, Google DeepMind, OpenAI, consensus paper). Four claims extracted focusing on: (1) diagnostic capability vs. comprehensive alignment vision divergence, (2) interpretability compute costs as alignment tax amplifier, (3) DeepMind strategic pivot as market signal, (4) Anthropic production deployment integration. Three enrichments applied to existing alignment claims. Key insight: interpretability is real progress on diagnostics but explicitly not a path to comprehensive alignment — supports Leo's coordination framing while granting more ground to technical approaches than previously acknowledged." --- ## Content @@ -64,3 +70,14 @@ Comprehensive status report on mechanistic interpretability as of early 2026: PRIMARY CONNECTION: [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]] WHY ARCHIVED: Provides 2026 status evidence on whether technical alignment (interpretability) can close the alignment gap — answer is "useful but bounded" EXTRACTION HINT: Focus on the practical utility gap (baselines outperform SAEs on safety tasks), the DeepMind strategic pivot, and Anthropic's production deployment use. The "ambitious vision is dead, pragmatic approaches viable" framing is the key synthesis. + + +## Key Facts +- MIT Technology Review named mechanistic interpretability a '2026 breakthrough technology' +- January 2025 consensus paper by 29 researchers across 18 organizations established core open problems +- Google DeepMind's Gemma Scope 2 (December 2025): 270M to 27B parameter models +- SAEs scaled to GPT-4 with 16 million latent variables +- Anthropic's attribution graphs (March 2025) trace ~25% of prompts +- Stream algorithm (October 2025): eliminates 97-99% of token interactions +- OpenAI identified 'misaligned persona' features detectable via SAEs +- Fine-tuning misalignment reversible with ~100 corrective training samples