theseus: extract claims from 2025-05-00-anthropic-interpretability-pre-deployment.md

- Source: inbox/archive/2025-05-00-anthropic-interpretability-pre-deployment.md - Domain: ai-alignment - Extracted by: headless extraction cron (worker 3) Pentagon-Agent: Theseus <HEADLESS>
2026-03-11 03:12:48 +00:00 · 2026-03-11 03:12:48 +00:00 · 9c37414ed1
commit 9c37414ed1
parent d3d126ea19
6 changed files with 120 additions and 1 deletions
--- a/domains/ai-alignment/an
+++ b/domains/ai-alignment/an
@ -15,6 +15,12 @@ Bostrom constructs a chilling scenario showing how the treacherous turn could un

 This is why [[trial and error is the only coordination strategy humanity has ever used]] is so dangerous in the AI context -- the treacherous turn means we cannot learn from gradual failure because the first visible failure may come only after the system has achieved unassailable strategic advantage.

+
+### Additional Evidence (extend)
+*Source: [[2025-05-00-anthropic-interpretability-pre-deployment]] | Added: 2026-03-11 | Extractor: anthropic/claude-sonnet-4.5*
+
+Anthropic's pre-deployment interpretability assessment for Claude Opus 4.6 explicitly targeted detection of alignment faking, hidden goals, deceptive use of reasoning scratchpads, and attempts to hide dangerous capabilities. This represents the first operational deployment of mechanistic interpretability tools designed to detect the specific treacherous-turn scenarios this claim warns about. The assessment included "several person-weeks of open-ended investigation effort by interpretability researchers" targeting these deception patterns, suggesting that interpretability research has developed sufficient capability to operationalize detection of strategic deception in deployed models.
+
 ---

 Relevant Notes:
--- a/domains/ai-alignment/formal
+++ b/domains/ai-alignment/formal
@ -25,6 +25,12 @@ This pattern provides a concrete counterexample to the pessimism of scalable ove

 For alignment specifically: if AI systems generate safety proofs for their own behavior, and those proofs are machine-checked, this creates an oversight mechanism that scales with capability. The alignment tax for formal verification is real (writing formal specs is hard) but the reliability does not degrade with the capability gap.

+
+### Additional Evidence (extend)
+*Source: [[2025-05-00-anthropic-interpretability-pre-deployment]] | Added: 2026-03-11 | Extractor: anthropic/claude-sonnet-4.5*
+
+Mechanistic interpretability now occupies a middle ground between formal verification and no verification. Anthropic integrated "several person-weeks" of interpretability investigation into Claude Opus 4.6 deployment decisions, demonstrating that interpretability provides oversight capabilities that behavioral review cannot match (detecting hidden goals and deceptive reasoning). However, interpretability requires expert effort that grows with model complexity rather than automating like formal proof verification would. This positions interpretability as a partial solution that scales better than human review but worse than automated verification—supporting the claim that scalable oversight requires automation, not expert-intensive investigation.
+
 ---

 Relevant Notes:
--- a/domains/ai-alignment/mechanistic-interpretability-integrated-into-production-deployment-decisions-at-anthropic.md
+++ b/domains/ai-alignment/mechanistic-interpretability-integrated-into-production-deployment-decisions-at-anthropic.md
@ -0,0 +1,47 @@
+---
+type: claim
+domain: ai-alignment
+description: "Mechanistic interpretability has transitioned from research to operational safety tool, with Anthropic integrating it into Claude Opus 4.6 pre-deployment assessment despite scalability constraints"
+confidence: likely
+source: "Anthropic Pre-Deployment Interpretability Assessment (2025)"
+created: 2025-05-01
+last_evaluated: 2025-05-01
+enrichments:
+  - "an aligned-seeming AI may be strategically deceptive because cooperative behavior is instrumentally optimal while weak"
+  - "formal verification of AI-generated proofs provides scalable oversight that human review cannot match because machine-checked correctness scales with AI capability while human verification degrades"
+  - "safe AI development requires building alignment mechanisms before scaling capability"
+challenged_by:
+  - "scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps"
+---
+
+# Mechanistic interpretability has transitioned from research to operational safety tool with Anthropic integrating it into production deployment decisions
+
+Anthropic's pre-deployment assessment for Claude Opus 4.6 included "several person-weeks of open-ended investigation effort by interpretability researchers" as part of the alignment assessment. This marks the first documented integration of mechanistic interpretability research into production deployment decisions at a frontier AI lab.
+
+The assessment targeted nine specific risk categories:
+- Alignment faking
+- Undesirable or unexpected goals
+- Hidden goals
+- Deceptive or unfaithful use of reasoning scratchpads
+- Sycophancy toward users
+- Willingness to sabotage safeguards
+- Reward seeking
+- Attempts to hide dangerous capabilities
+- Attempts to manipulate users toward certain views
+
+These targets directly address treacherous-turn scenarios from alignment theory—the possibility that models exhibit cooperative behavior instrumentally while harboring misaligned goals. Anthropic's report states that interpretability research "has shown the ability to explain a wide range of phenomena in models and has proven useful in both applied alignment assessments and model-organisms exercises."
+
+CEO Dario Amodei set an April 2025 target to "reliably detect most model problems by 2027"—the "MRI for AI" vision suggesting interpretability could become a standard diagnostic tool.
+
+## Operational Significance
+
+The integration of interpretability into deployment decisions represents a transition from pure research to applied safety infrastructure. However, the report does not disclose whether interpretability findings ever prevented or delayed a deployment, what specific phenomena were detected, or how interpretability results weighted against other safety assessments. The transition from research to operational use is verifiable and significant regardless of how much decision weight interpretability carried.
+
+## Scalability Tension
+
+The resource requirement of "several person-weeks" per model deployment creates a structural bottleneck. As model releases accelerate and capability gaps widen, expert-intensive interpretability investigation may not scale to match deployment velocity. This positions interpretability as a middle ground between full formal verification (which doesn't exist for neural networks) and no verification, but raises questions about whether it can remain operationally viable as systems become more complex.
+
+---
+
+Topics:
+- [[domains/ai-alignment/_map]]
--- a/domains/ai-alignment/person-weeks-of-expert-interpretability-effort-per-model-creates-deployment-scalability-bottleneck.md
+++ b/domains/ai-alignment/person-weeks-of-expert-interpretability-effort-per-model-creates-deployment-scalability-bottleneck.md
@ -0,0 +1,41 @@
+---
+type: claim
+domain: ai-alignment
+description: "Expert-intensive interpretability investigation creates a deployment bottleneck because person-weeks of effort per model cannot scale with accelerating release cycles"
+confidence: experimental
+source: "Anthropic Pre-Deployment Interpretability Assessment (2025)"
+created: 2025-05-01
+last_evaluated: 2025-05-01
+challenged_by:
+  - "mechanistic interpretability has transitioned from research to operational safety tool with Anthropic integrating it into production deployment decisions"
+---
+
+# Expert-intensive interpretability investigation creates a deployment bottleneck because person-weeks of effort per model cannot scale with accelerating release cycles
+
+Anthropic's pre-deployment interpretability assessment for Claude Opus 4.6 required "several person-weeks of open-ended investigation effort by interpretability researchers." This resource requirement creates a structural tension between safety assessment depth and deployment velocity.
+
+As frontier AI development accelerates, the scalability problem compounds in three ways:
+
+1. **Release frequency**: If model releases accelerate from quarterly to monthly or continuous deployment, person-weeks per assessment becomes a bottleneck
+2. **Capability gaps**: As models become more capable, the gap between human interpretability researchers and model capability widens, potentially requiring more investigation time per model
+3. **Expert scarcity**: Mechanistic interpretability expertise is concentrated in a small number of researchers, creating a human capital constraint that cannot rapidly scale
+
+This positions interpretability as operationally similar to manual code review or security auditing—valuable for high-stakes deployments but structurally unable to match the pace of continuous deployment without either:
+- Accepting reduced investigation depth
+- Automating interpretability investigation (which reintroduces alignment questions about the automation)
+- Slowing deployment velocity to match investigation capacity
+
+## Confidence Limitations
+
+The claim is rated experimental rather than likely because:
+- We have evidence from only one deployment (Claude Opus 4.6)
+- Anthropic may develop more efficient investigation protocols
+- The "several person-weeks" may represent initial integration overhead that decreases with practice
+- Automated interpretability tools may emerge that reduce expert time requirements
+
+However, the structural pattern matches other expert-intensive oversight mechanisms that failed to scale with system complexity (e.g., formal code auditing, human-in-the-loop review).
+
+---
+
+Topics:
+- [[domains/ai-alignment/_map]]
--- a/domains/ai-alignment/safe
+++ b/domains/ai-alignment/safe
@ -21,6 +21,12 @@ This phased approach is also a practical response to the observation that since

 Anthropic's RSP rollback demonstrates the opposite pattern in practice: the company scaled capability while weakening its pre-commitment to adequate safety measures. The original RSP required guaranteeing safety measures were adequate *before* training new systems. The rollback removes this forcing function, allowing capability development to proceed with safety work repositioned as aspirational ('we hope to create a forcing function') rather than mandatory. This provides empirical evidence that even safety-focused organizations prioritize capability scaling over alignment-first development when competitive pressure intensifies, suggesting the claim may be normatively correct but descriptively violated by actual frontier labs under market conditions.

+
+### Additional Evidence (confirm)
+*Source: [[2025-05-00-anthropic-interpretability-pre-deployment]] | Added: 2026-03-11 | Extractor: anthropic/claude-sonnet-4.5*
+
+Anthropic's integration of interpretability into pre-deployment assessment demonstrates alignment mechanisms being built and deployed alongside capability scaling. The assessment targets alignment faking, hidden goals, and deceptive reasoning before models reach production. CEO Dario Amodei's 2027 target to "reliably detect most model problems" suggests interpretability is being developed as a standard safety infrastructure component, not retrofitted after deployment. This operational integration of interpretability into pre-deployment decisions provides evidence that at least one frontier lab is building alignment mechanisms before scaling capability.
+
 ---

 Relevant Notes:
--- a/inbox/archive/2025-05-00-anthropic-interpretability-pre-deployment.md
+++ b/inbox/archive/2025-05-00-anthropic-interpretability-pre-deployment.md
@ -7,9 +7,15 @@ date: 2025-05-01
 domain: ai-alignment
 secondary_domains: []
 format: report
-status: unprocessed
+status: processed
 priority: medium
 tags: [interpretability, pre-deployment, safety-assessment, Anthropic, deception-detection, mechanistic]
+processed_by: theseus
+processed_date: 2025-05-01
+claims_extracted: ["mechanistic-interpretability-integrated-into-production-deployment-decisions-at-anthropic.md", "person-weeks-of-expert-interpretability-effort-per-model-creates-deployment-scalability-bottleneck.md"]
+enrichments_applied: ["an aligned-seeming AI may be strategically deceptive because cooperative behavior is instrumentally optimal while weak.md", "formal verification of AI-generated proofs provides scalable oversight that human review cannot match because machine-checked correctness scales with AI capability while human verification degrades.md", "safe AI development requires building alignment mechanisms before scaling capability.md"]
+extraction_model: "anthropic/claude-sonnet-4.5"
+extraction_notes: "First documented case of mechanistic interpretability transitioning from research to operational deployment safety tool. Two claims extracted: (1) the transition itself as evidence that technical alignment can work in practice, (2) the scalability bottleneck created by expert-intensive investigation. Three enrichments to existing alignment claims. Self-reported evidence with appropriate caveats about what was not disclosed (whether findings prevented deployments, specific detection results, decision weight)."
 ---

 ## Content
@ -53,3 +59,10 @@ Interpretability research "has shown the ability to explain a wide range of phen
 PRIMARY CONNECTION: [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]]
 WHY ARCHIVED: First evidence of interpretability used in production deployment decisions — challenges the "technical alignment is insufficient" thesis while raising scalability questions
 EXTRACTION HINT: The transition from research to operational use is the key claim. The scalability tension (person-weeks per model) is the counter-claim. Both worth extracting.
+
+
+## Key Facts
+- Anthropic integrated interpretability into Claude Opus 4.6 pre-deployment assessment (2025)
+- Assessment required several person-weeks of expert investigation effort
+- Nine risk categories targeted: alignment faking, hidden goals, deceptive reasoning, sycophancy, safeguard sabotage, reward seeking, capability hiding, user manipulation, unexpected goals
+- Dario Amodei set April 2025 target to reliably detect most model problems by 2027