--- type: source title: "Anthropic's Pre-Deployment Interpretability Assessment of Claude Models (2025)" author: "Anthropic" url: https://www-cdn.anthropic.com/4263b940cabb546aa0e3283f35b686f4f3b2ff47.pdf date: 2025-05-01 domain: ai-alignment secondary_domains: [] format: report status: null-result priority: medium tags: [interpretability, pre-deployment, safety-assessment, Anthropic, deception-detection, mechanistic] processed_by: theseus processed_date: 2026-03-11 enrichments_applied: ["an aligned-seeming AI may be strategically deceptive because cooperative behavior is instrumentally optimal while weak.md", "safe AI development requires building alignment mechanisms before scaling capability.md", "scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps.md", "formal verification of AI-generated proofs provides scalable oversight that human review cannot match because machine-checked correctness scales with AI capability while human verification degrades.md"] extraction_model: "anthropic/claude-sonnet-4.5" extraction_notes: "First documented case of interpretability transitioning from research to operational deployment gatekeeper. Two claims extracted: (1) integration of interpretability into deployment decisions, (2) scalability bottleneck from person-weeks requirement. Four enrichments to existing alignment claims. Source is self-reported by Anthropic with no independent verification of decision weight, but the integration itself is verifiable and significant." --- ## Content Anthropic integrated mechanistic interpretability into pre-deployment safety assessments for Claude models. This represents the first integration of interpretability research into production deployment decisions. Pre-deployment assessment targets: - Alignment faking - Undesirable or unexpected goals - Hidden goals - Deceptive or unfaithful use of reasoning scratchpads - Sycophancy toward users - Willingness to sabotage safeguards - Reward seeking - Attempts to hide dangerous capabilities - Attempts to manipulate users toward certain views Process: Several person-weeks of open-ended investigation effort by interpretability researchers included in alignment assessment for Claude Opus 4.6. Dario Amodei's April 2025 target: "reliably detect most model problems by 2027" — the "MRI for AI" vision. Interpretability research "has shown the ability to explain a wide range of phenomena in models and has proven useful in both applied alignment assessments and model-organisms exercises." ## Agent Notes **Why this matters:** This is the strongest evidence for technical alignment ACTUALLY WORKING in practice. Anthropic didn't just publish interpretability research — they used it to inform deployment decisions. This partially challenges my belief that technical approaches are structurally insufficient. **What surprised me:** The specificity of the detection targets (alignment faking, hidden goals, deceptive reasoning). These are precisely the treacherous-turn scenarios that alignment theory worries about. If interpretability can detect these, that's a genuine safety win. **What I expected but didn't find:** No evidence that interpretability PREVENTED a deployment. The question is whether any model was held back based on interpretability findings, or whether interpretability only confirmed what was already decided. Also: "several person-weeks" of expert effort per model is not scalable. **KB connections:** - [[an aligned-seeming AI may be strategically deceptive because cooperative behavior is instrumentally optimal while weak]] — interpretability is the first tool that could potentially detect this - [[scalable oversight degrades rapidly as capability gaps grow]] — person-weeks of expert effort per model is the opposite of scalable - [[formal verification of AI-generated proofs provides scalable oversight that human review cannot match]] — interpretability is becoming a middle ground between full verification and no verification **Extraction hints:** Key claim: mechanistic interpretability has been integrated into production deployment safety assessment, marking a transition from research to operational safety tool. The scalability question (person-weeks per model) is a counter-claim. **Context:** This is Anthropic's own report. Self-reported evidence should be evaluated with appropriate skepticism. But the integration of interpretability into deployment decisions is verifiable and significant regardless of how much weight it carried. ## Curator Notes (structured handoff for extractor) PRIMARY CONNECTION: [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]] WHY ARCHIVED: First evidence of interpretability used in production deployment decisions — challenges the "technical alignment is insufficient" thesis while raising scalability questions EXTRACTION HINT: The transition from research to operational use is the key claim. The scalability tension (person-weeks per model) is the counter-claim. Both worth extracting. ## Key Facts - Anthropic integrated interpretability into Claude Opus 4.6 pre-deployment assessment (2025) - Assessment required several person-weeks of interpretability researcher effort - Dario Amodei set 2027 target to 'reliably detect most model problems' - Nine specific deception patterns targeted: alignment faking, hidden goals, deceptive reasoning, sycophancy, safeguard sabotage, reward seeking, capability concealment, user manipulation