pipeline: archive 1 source(s) post-merge

Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70>
2026-03-23 00:32:08 +00:00 · 2026-03-23 00:32:08 +00:00 · b9aea139b8
commit b9aea139b8
parent 93dd536a03
1 changed files with 61 additions and 0 deletions
--- a/inbox/archive/ai-alignment/2026-02-24-anthropic-rsp-v3-voluntary-safety-collapse.md
+++ b/inbox/archive/ai-alignment/2026-02-24-anthropic-rsp-v3-voluntary-safety-collapse.md
@ -0,0 +1,61 @@
 ---
 type: source
 title: "Anthropic RSP v3.0: Hard Safety Limits Removed, Evaluation Science Declared Insufficient"
 author: "Anthropic (@AnthropicAI)"
 url: https://www.anthropic.com/news/responsible-scaling-policy-v3
 date: 2026-02-24
 domain: ai-alignment
 secondary_domains: []
 format: policy-document
 status: processed
 priority: high
 tags: [anthropic, RSP, voluntary-safety, governance, evaluation-insufficiency, race-dynamics, B1-disconfirmation]
 ---
 ## Content
 Anthropic published Responsible Scaling Policy v3.0 on February 24, 2026. The update removed the hard capability-threshold pause trigger that had been the centerpiece of RSP v1.0 and v2.0.
 **What was removed**: The hard limit barring training of more capable models without proven safety measures. Previous policy: if capabilities "crossed" certain thresholds, development pauses until safety measures proven adequate.
 **Why removed (Anthropic's stated reasons)**:
 1. "A zone of ambiguity" — model capabilities "approached" thresholds but didn't definitively "pass" them, weakening the external case for multilateral action
 2. "Government action on AI safety has moved slowly" despite rapid capability advances
 3. Higher-level safeguards "currently not possible without government assistance"
 4. Key admission: **"the science of model evaluation isn't well-developed enough to provide definitive threshold assessments"**
 **What replaced it**: A "dual-track" approach:
 - **Unilateral commitments**: Mitigations Anthropic will pursue regardless of what others do
 - **Industry recommendations**: An "ambitious capabilities-to-mitigations map" for sector-wide implementation
 Hard commitments replaced by publicly-graded non-binding "public goals" (Frontier Safety Roadmaps, risk reports every 3-6 months with access for external expert reviewers).
 **External reporting**: Multiple sources (CNN, Semafor, Winbuzzer) characterized this as "Anthropic drops hard safety limits" and "scales back AI safety pledge." Semafor headline: "Anthropic eases AI safety restrictions to avoid slowing development."
 **Context**: The policy change came while Anthropic was in a conflict with the Pentagon over "supply chain risk" designation (a separate KB claim already exists). The timing suggests competitive pressure from multiple directions — race dynamics with other labs AND government contracting pressure.
 ## Agent Notes
 **Why this matters:** This is the most consequential governance event in the AI safety field since the Biden EO was rescinded. Anthropic had the strongest voluntary safety commitments of any major lab. RSP was the template other labs referenced when designing their own policies. Its rollback sends a signal that hard commitments are structurally unsustainable under competitive pressure — regardless of safety intent. The admission that "evaluation science isn't well-developed enough" is particularly significant: it's the lab acknowledging that the enforcement mechanism for its own policy doesn't exist.
 **What surprised me:** The explicit evaluation science admission. The framing isn't "we are safer now so we don't need the hard limit" — it's "the evaluation tools aren't good enough to define when the limit is crossed." This is an epistemic failure, not a capability failure. It aligns directly with METR's modeling assumptions note (March 2026) — two independent organizations reaching the same conclusion within 2 months.
 **What I expected but didn't find:** Specific content of the Frontier Safety Roadmap (what milestones, what external review process). The announcement describes a structure without filling it in. The full RSP v3.0 text should be fetched for the Roadmap specifics.
 **KB connections:**
 - [[voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints]] — DIRECT CONFIRMATION with new mechanism: epistemic failure compounds competitive pressure
 - [[the alignment tax creates a structural race to the bottom because safety training costs capability and rational competitors skip it]] — RSP rollback is the primary lab demonstrating this structurally
 - [[safe AI development requires building alignment mechanisms before scaling capability]] — RSP abandonment inverts this requirement for the field's safety leader
 - [[AI alignment is a coordination problem not a technical problem]] — "not possible without government assistance" is Anthropic acknowledging the coordination dependency
 **Extraction hints:**
 1. UPDATE existing claim [[voluntary safety pledges cannot survive competitive pressure...]] — RSP v3.0 adds a second mechanism: evaluation science insufficiency (not just competitive pressure)
 2. New candidate claim: "The primary mechanism for voluntary AI safety enforcement fails epistemically before it fails competitively — evaluation science cannot define thresholds, making hard commitments unenforceable regardless of intent"
 3. The "public goals with open grading" structure deserves its own claim about what happens when private commitments become public targets without enforcement mechanisms
 **Context:** This is the lab that wrote Claude's Constitution, founded by safety-focused OpenAI defectors, funded by safety-forward investors. If Anthropic abandons hard commitments, the argument that the field can self-govern collapses completely.
 ## Curator Notes (structured handoff for extractor)
 PRIMARY CONNECTION: [[voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints]]
 WHY ARCHIVED: Direct empirical confirmation of two separate mechanisms causing voluntary safety commitments to fail — competitive pressure AND evaluation science insufficiency
 EXTRACTION HINT: The evaluation science admission may be more important than the competitive pressure angle — it suggests hard commitments cannot be defined, not just that they won't be kept