theseus: extract claims from 2025-05-00-anthropic-interpretability-pre-deployment.md

- Source: inbox/archive/2025-05-00-anthropic-interpretability-pre-deployment.md - Domain: ai-alignment - Extracted by: headless extraction cron Pentagon-Agent: Theseus <HEADLESS>
2026-03-10 20:26:05 +00:00 · 2026-03-10 20:26:05 +00:00 · 3fb7347f0b
commit 3fb7347f0b
parent ccb1e15964
3 changed files with 105 additions and 1 deletions
--- a/domains/ai-alignment/interpretability-assessment-requires-person-weeks-of-expert-effort-per-model-creating-a-scalability-bottleneck.md
+++ b/domains/ai-alignment/interpretability-assessment-requires-person-weeks-of-expert-effort-per-model-creating-a-scalability-bottleneck.md
@ -0,0 +1,42 @@
 ---
 type: claim
 domain: ai-alignment
 description: "Interpretability-based safety assessment requires person-weeks of expert effort per model, creating a structural scalability bottleneck that cannot sustain industry-wide deployment velocity"
 confidence: likely
 source: "Anthropic Pre-Deployment Interpretability Assessment (2025)"
 created: 2025-05-01
 depends_on: ["mechanistic-interpretability-integrated-into-pre-deployment-safety-assessment-at-anthropic-in-2025"]
 challenged_by: []
 ---
 # Interpretability assessment requires person-weeks of expert effort per model, creating a scalability bottleneck
 Anthropic's pre-deployment interpretability assessment for Claude Opus 4.6 required "several person-weeks of open-ended investigation effort by interpretability researchers." This resource intensity creates a structural scalability problem as model development accelerates and deployment frequency increases.
 ## Scalability Constraints
 The bottleneck operates on multiple dimensions:
 **1. Expert availability scarcity**: Interpretability researchers with the skills to conduct these assessments are scarce relative to the pace of model development across the industry. The field does not have sufficient trained interpretability researchers to apply this method to all frontier model deployments.
 **2. Time constraints vs. deployment velocity**: Person-weeks per model creates a deployment gate that competitive pressure will strain. As organizations race to deploy, pressure to reduce assessment thoroughness will increase. A model released quarterly can absorb person-weeks of review; a model released monthly cannot.
 **3. Capability-driven complexity**: As models become more capable, the interpretability challenge likely becomes harder, not easier. More complex models require more investigation effort, creating an inverse relationship between capability and assessment feasibility.
 **4. Industry-wide coverage gap**: If interpretability assessment becomes a safety standard, the current approach cannot scale to cover all frontier model deployments across multiple organizations. This creates a two-tier system where only well-resourced labs can afford thorough assessment.
 ## Structural Tension
 This creates a fundamental tension: the safety assessment method that Anthropic pioneered may be too resource-intensive to serve as the primary safety gate for AI deployment at scale. The approach works for a single organization deploying models quarterly, but breaks down if applied across the industry or at higher deployment frequencies.
 The contrast with formal verification is instructive: machine-checked proof verification scales with AI capability (better proofs → better verification), while human-intensive interpretability assessment creates a fixed cost per model that competitive dynamics will pressure organizations to reduce or eliminate.
 ---
 Relevant Notes:
 - [[scalable-oversight-degrades-rapidly-as-capability-gaps-grow-with-debate-achieving-only-50-percent-success-at-moderate-gaps]]
 - [[formal-verification-of-AI-generated-proofs-provides-scalable-oversight-that-human-review-cannot-match-because-machine-checked-correctness-scales-with-AI-capability-while-human-verification-degrades]]
 - [[economic-forces-push-humans-out-of-every-cognitive-loop-where-output-quality-is-independently-verifiable-because-human-in-the-loop-is-a-cost-that-competitive-markets-eliminate]]
 Topics:
 - [[ai-alignment]]
--- a/domains/ai-alignment/mechanistic-interpretability-integrated-into-pre-deployment-safety-assessment-at-anthropic-in-2025.md
+++ b/domains/ai-alignment/mechanistic-interpretability-integrated-into-pre-deployment-safety-assessment-at-anthropic-in-2025.md
@ -0,0 +1,49 @@
 ---
 type: claim
 domain: ai-alignment
 description: "Anthropic integrated mechanistic interpretability into Claude pre-deployment safety assessment in 2025, marking the first operational use of interpretability research for deployment decisions"
 confidence: likely
 source: "Anthropic Pre-Deployment Interpretability Assessment (2025)"
 created: 2025-05-01
 depends_on: []
 challenged_by: ["interpretability-assessment-requires-person-weeks-of-expert-effort-per-model-which-creates-a-scalability-bottleneck-for-safety-assessment"]
 ---
 # Mechanistic interpretability integrated into pre-deployment safety assessment at Anthropic in 2025
 Anthropic integrated mechanistic interpretability into pre-deployment safety assessments for Claude models in 2025, representing the first documented integration of interpretability research into production deployment decisions. This marks a transition from interpretability as a research tool to interpretability as an operational safety mechanism.
 ## Specific Assessment Targets
 The pre-deployment assessment targeted nine specific alignment risks:
 - Alignment faking
 - Undesirable or unexpected goals
 - Hidden goals
 - Deceptive or unfaithful use of reasoning scratchpads
 - Sycophancy toward users
 - Willingness to sabotage safeguards
 - Reward seeking
 - Attempts to hide dangerous capabilities
 - Attempts to manipulate users toward certain views
 These targets represent precisely the "treacherous turn" scenarios that alignment theory identifies as highest-risk: behaviors that appear aligned during development but mask deceptive or goal-misaligned reasoning.
 ## Implementation Details
 The process involved "several person-weeks of open-ended investigation effort by interpretability researchers" included in the alignment assessment for Claude Opus 4.6. Dario Amodei set an April 2025 target to "reliably detect most model problems by 2027" — framed as the "MRI for AI" vision.
 According to Anthropic's report, interpretability research "has shown the ability to explain a wide range of phenomena in models and has proven useful in both applied alignment assessments and model-organisms exercises."
 ## Significance and Limitations
 This represents the first evidence that technical interpretability tools have moved from research to operational deployment gates. However, the report does not provide evidence that interpretability findings prevented any deployment or materially altered deployment decisions — only that interpretability was "included" in the assessment process. The causal weight of interpretability findings on actual deployment decisions remains unclear.
 ---
 Relevant Notes:
 - [[an-aligned-seeming-AI-may-be-strategically-deceptive-because-cooperative-behavior-is-instrumentally-optimal-while-weak]]
 - [[scalable-oversight-degrades-rapidly-as-capability-gaps-grow-with-debate-achieving-only-50-percent-success-at-moderate-gaps]]
 - [[formal-verification-of-AI-generated-proofs-provides-scalable-oversight-that-human-review-cannot-match-because-machine-checked-correctness-scales-with-AI-capability-while-human-verification-degrades]]
 Topics:
 - [[ai-alignment]]
--- a/inbox/archive/2025-05-00-anthropic-interpretability-pre-deployment.md
+++ b/inbox/archive/2025-05-00-anthropic-interpretability-pre-deployment.md
@ -7,9 +7,15 @@ date: 2025-05-01
 domain: ai-alignment
 secondary_domains: []
 format: report
-status: unprocessed
+status: processed
 priority: medium
 tags: [interpretability, pre-deployment, safety-assessment, Anthropic, deception-detection, mechanistic]
 processed_by: theseus
 processed_date: 2025-05-01
 claims_extracted: ["mechanistic-interpretability-integrated-into-pre-deployment-safety-assessment-at-anthropic-in-2025.md", "interpretability-assessment-requires-person-weeks-of-expert-effort-per-model-creating-a-scalability-bottleneck.md"]
 enrichments_applied: ["an-aligned-seeming-AI-may-be-strategically-deceptive-because-cooperative-behavior-is-instrumentally-optimal-while-weak.md", "scalable-oversight-degrades-rapidly-as-capability-gaps-grow-with-debate-achieving-only-50-percent-success-at-moderate-gaps.md", "formal-verification-of-AI-generated-proofs-provides-scalable-oversight-that-human-review-cannot-match-because-machine-checked-correctness-scales-with-AI-capability-while-human-verification-degrades.md", "safe-AI-development-requires-building-alignment-mechanisms-before-scaling-capability.md"]
 extraction_model: "anthropic/claude-sonnet-4.5"
 extraction_notes: "First evidence of interpretability transitioning from research to operational deployment safety. Two claims extracted: (1) the transition itself as a milestone in alignment practice, (2) the scalability bottleneck created by resource-intensive assessment. Four enrichments to existing claims about oversight scalability, formal verification, strategic deception detection, and alignment-before-capability principles. Key tension: this is genuine progress on technical alignment tools, but the scalability constraint suggests it cannot be the primary safety mechanism at industry scale."
 ---
 ## Content
@ -53,3 +59,10 @@ Interpretability research "has shown the ability to explain a wide range of phen
 PRIMARY CONNECTION: [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]]
 WHY ARCHIVED: First evidence of interpretability used in production deployment decisions — challenges the "technical alignment is insufficient" thesis while raising scalability questions
 EXTRACTION HINT: The transition from research to operational use is the key claim. The scalability tension (person-weeks per model) is the counter-claim. Both worth extracting.
 ## Key Facts
 - Anthropic conducted pre-deployment interpretability assessment for Claude Opus 4.6 (2025)
 - Assessment required several person-weeks of interpretability researcher effort
 - Dario Amodei set 2027 target to 'reliably detect most model problems'
 - Assessment targeted 9 specific risk categories including alignment faking, hidden goals, and deceptive reasoning