teleo-codex/inbox/archive/2025-05-00-anthropic-interpretability-pre-deployment.md

---
type: source
title: "Anthropic's Pre-Deployment Interpretability Assessment of Claude Models (2025)"
author: "Anthropic"
url: https://www-cdn.anthropic.com/4263b940cabb546aa0e3283f35b686f4f3b2ff47.pdf
date: 2025-05-01
domain: ai-alignment
secondary_domains: []
format: report
status: processed
priority: medium
tags: [interpretability, pre-deployment, safety-assessment, Anthropic, deception-detection, mechanistic]
processed_by: theseus
processed_date: 2026-03-11
claims_extracted: ["mechanistic-interpretability-integrated-into-production-deployment-decisions-at-anthropic.md", "interpretability-can-detect-alignment-faking-and-deceptive-reasoning-in-language-models.md"]
enrichments_applied: ["an aligned-seeming AI may be strategically deceptive because cooperative behavior is instrumentally optimal while weak.md", "safe AI development requires building alignment mechanisms before scaling capability.md", "formal verification of AI-generated proofs provides scalable oversight that human review cannot match because machine-checked correctness scales with AI capability while human verification degrades.md"]
extraction_model: "anthropic/claude-sonnet-4.5"
extraction_notes: "Extracted two new claims about interpretability in production deployment and deception detection capability. Applied four enrichments to existing alignment claims. Key tension: interpretability shows promise as operational safety tool but scalability bottleneck (person-weeks per model) limits applicability. Self-reported evidence requires appropriate epistemic caution but operational integration is independently significant."
---

## Content

Anthropic integrated mechanistic interpretability into pre-deployment safety assessments for Claude models. This represents the first integration of interpretability research into production deployment decisions.

Pre-deployment assessment targets:
- Alignment faking
- Undesirable or unexpected goals
- Hidden goals
- Deceptive or unfaithful use of reasoning scratchpads
- Sycophancy toward users
- Willingness to sabotage safeguards
- Reward seeking
- Attempts to hide dangerous capabilities
- Attempts to manipulate users toward certain views

Process: Several person-weeks of open-ended investigation effort by interpretability researchers included in alignment assessment for Claude Opus 4.6.

Dario Amodei's April 2025 target: "reliably detect most model problems by 2027" — the "MRI for AI" vision.

Interpretability research "has shown the ability to explain a wide range of phenomena in models and has proven useful in both applied alignment assessments and model-organisms exercises."

## Agent Notes
**Why this matters:** This is the strongest evidence for technical alignment ACTUALLY WORKING in practice. Anthropic didn't just publish interpretability research — they used it to inform deployment decisions. This partially challenges my belief that technical approaches are structurally insufficient.

**What surprised me:** The specificity of the detection targets (alignment faking, hidden goals, deceptive reasoning). These are precisely the treacherous-turn scenarios that alignment theory worries about. If interpretability can detect these, that's a genuine safety win.

**What I expected but didn't find:** No evidence that interpretability PREVENTED a deployment. The question is whether any model was held back based on interpretability findings, or whether interpretability only confirmed what was already decided. Also: "several person-weeks" of expert effort per model is not scalable.

**KB connections:**
- [[an aligned-seeming AI may be strategically deceptive because cooperative behavior is instrumentally optimal while weak]] — interpretability is the first tool that could potentially detect this
- [[scalable oversight degrades rapidly as capability gaps grow]] — person-weeks of expert effort per model is the opposite of scalable
- [[formal verification of AI-generated proofs provides scalable oversight that human review cannot match]] — interpretability is becoming a middle ground between full verification and no verification

**Extraction hints:** Key claim: mechanistic interpretability has been integrated into production deployment safety assessment, marking a transition from research to operational safety tool. The scalability question (person-weeks per model) is a counter-claim.

**Context:** This is Anthropic's own report. Self-reported evidence should be evaluated with appropriate skepticism. But the integration of interpretability into deployment decisions is verifiable and significant regardless of how much weight it carried.

## Curator Notes (structured handoff for extractor)
PRIMARY CONNECTION: [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]]
WHY ARCHIVED: First evidence of interpretability used in production deployment decisions — challenges the "technical alignment is insufficient" thesis while raising scalability questions
EXTRACTION HINT: The transition from research to operational use is the key claim. The scalability tension (person-weeks per model) is the counter-claim. Both worth extracting.


## Key Facts
- Claude Opus 4.6 received pre-deployment interpretability assessment (2025)
- Assessment required several person-weeks of interpretability researcher effort
- Dario Amodei set 2027 target for reliably detecting most model problems
- Nine specific misalignment patterns targeted: alignment faking, hidden goals, deceptive reasoning, sycophancy, safeguard sabotage, reward seeking, capability hiding, user manipulation