diff --git a/inbox/archive/2025-05-00-anthropic-interpretability-pre-deployment.md b/inbox/archive/2025-05-00-anthropic-interpretability-pre-deployment.md index c834d1198..20a4a3a2e 100644 --- a/inbox/archive/2025-05-00-anthropic-interpretability-pre-deployment.md +++ b/inbox/archive/2025-05-00-anthropic-interpretability-pre-deployment.md @@ -7,9 +7,14 @@ date: 2025-05-01 domain: ai-alignment secondary_domains: [] format: report -status: unprocessed +status: null-result priority: medium tags: [interpretability, pre-deployment, safety-assessment, Anthropic, deception-detection, mechanistic] +processed_by: theseus +processed_date: 2026-03-11 +enrichments_applied: ["an aligned-seeming AI may be strategically deceptive because cooperative behavior is instrumentally optimal while weak.md", "safe AI development requires building alignment mechanisms before scaling capability.md", "formal verification of AI-generated proofs provides scalable oversight that human review cannot match because machine-checked correctness scales with AI capability while human verification degrades.md"] +extraction_model: "anthropic/claude-sonnet-4.5" +extraction_notes: "Extracted two complementary claims: (1) interpretability has transitioned from research to operational safety tool, (2) scalability bottleneck limits its viability as primary safety mechanism. Three enrichments to existing alignment claims. This is significant evidence that technical alignment approaches are being operationalized, while simultaneously revealing scalability constraints that may limit their effectiveness under competitive pressure." --- ## Content @@ -53,3 +58,10 @@ Interpretability research "has shown the ability to explain a wide range of phen PRIMARY CONNECTION: [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]] WHY ARCHIVED: First evidence of interpretability used in production deployment decisions — challenges the "technical alignment is insufficient" thesis while raising scalability questions EXTRACTION HINT: The transition from research to operational use is the key claim. The scalability tension (person-weeks per model) is the counter-claim. Both worth extracting. + + +## Key Facts +- Anthropic conducted pre-deployment interpretability assessment for Claude Opus 4.6 (2025) +- Assessment targeted nine risk categories: alignment faking, hidden goals, deceptive reasoning, sycophancy, safeguard sabotage, reward seeking, capability hiding, user manipulation, unexpected goals +- Dario Amodei set April 2025 target to 'reliably detect most model problems by 2027' (MRI for AI vision) +- Assessment required several person-weeks of interpretability researcher effort