77 lines
7.1 KiB
Markdown
77 lines
7.1 KiB
Markdown
---
|
|
type: source
|
|
title: "GovAI Analysis: RSP v3.0 Adds Transparency Infrastructure While Weakening Binding Commitments"
|
|
author: "Centre for the Governance of AI (GovAI)"
|
|
url: https://www.governance.ai/analysis/anthropics-rsp-v3-0-how-it-works-whats-changed-and-some-reflections
|
|
date: 2026-02-24
|
|
domain: ai-alignment
|
|
secondary_domains: []
|
|
format: blog
|
|
status: enrichment
|
|
priority: high
|
|
tags: [RSP-v3, Anthropic, governance-weakening, pause-commitment, RAND-Level-4, cyber-ops-removed, interpretability-assessment, frontier-safety-roadmap, self-reporting]
|
|
processed_by: theseus
|
|
processed_date: 2026-03-26
|
|
extraction_model: "anthropic/claude-sonnet-4.5"
|
|
---
|
|
|
|
## Content
|
|
|
|
GovAI's analysis of RSP v3.0 (effective February 24, 2026) identifies both genuine advances and structural weakening relative to earlier versions.
|
|
|
|
**New additions (genuine progress):**
|
|
- Mandatory Frontier Safety Roadmap: public, updated approximately quarterly, covering Security / Alignment / Safeguards / Policy
|
|
- Periodic Risk Reports: every 3-6 months
|
|
- Interpretability-informed alignment assessment: commitment to incorporate mechanistic interpretability and adversarial red-teaming into formal alignment threshold evaluation by October 2026
|
|
- Explicit separation of unilateral commitments vs. industry recommendations
|
|
|
|
**Structural weakening (specific changes, cited):**
|
|
1. **Pause commitment removed entirely** — previous RSP language implying Anthropic would pause development if risks were unacceptably high was eliminated. No explanation provided.
|
|
2. **RAND Security Level 4 protections demoted** — previously treated as implicit requirements; appear only as "recommendations" in v3.0
|
|
3. **Radiological/nuclear and cyber operations removed from binding commitments** — without public explanation. Cyber operations is the domain with the strongest real-world dangerous capability evidence as of 2026; its removal from binding RSP commitments is particularly notable.
|
|
4. **Only next capability threshold specified** (not a ladder of future thresholds), on grounds that "specifying mitigations for more advanced future capability levels is overly rigid"
|
|
5. **Roadmap goals explicitly framed as non-binding** — described as "ambitious but achievable" rather than commitments
|
|
|
|
**Accountability gap (unchanged):**
|
|
Independent review "triggered only under narrow conditions." Risk Reports rely on Anthropic grading its own homework. Self-reporting remains the primary accountability mechanism.
|
|
|
|
**The LessWrong "measurement uncertainty loophole" critique:**
|
|
RSP v3.0 introduced language allowing Anthropic to proceed when uncertainty exists about whether risks are *present*, rather than requiring clear evidence of safety before deployment. Critics argue this inverts the precautionary logic of the ASL-3 activation — where uncertainty triggered *more* protection. Whether precautionary activation is genuine caution or a cover for weaker standards depends on which direction ambiguity is applied. Both appear in RSP v3.0, applied in opposite directions in different contexts.
|
|
|
|
**October 2026 interpretability commitment specifics:**
|
|
- "Systematic alignment assessments incorporating mechanistic interpretability and adversarial red-teaming"
|
|
- Will examine Claude's behavioral patterns and propensities at the mechanistic level (internal computations, not just behavioral outputs)
|
|
- Adversarial red-teaming designed to "outperform the collective contributions of hundreds of bug bounty participants"
|
|
- Specific techniques not named in public summary
|
|
|
|
## Agent Notes
|
|
|
|
**Why this matters:** RSP v3.0 is the most developed public AI safety governance framework in existence. Its specific changes matter because they signal where governance is moving and what safety-conscious labs consider tractable vs. aspirational. The removal of pause commitment and cyber ops from binding commitments are the most concerning changes.
|
|
|
|
**What surprised me:** Cyber operations specifically removed from binding RSP commitments without explanation, in the same ~6-month window as the first documented large-scale AI-orchestrated cyberattack (August 2025) and AISLE's autonomous zero-day discovery (January 2026). The timing is striking. Either Anthropic decided cyber was too operational to govern via RSP, or the removal is unrelated to these events. Either way, the gap is real.
|
|
|
|
**What I expected but didn't find:** Any explanation for why radiological/nuclear and cyber operations were removed. The GovAI analysis notes the removal but doesn't report an explanation.
|
|
|
|
**KB connections:**
|
|
- [[voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints]] — RSP v3.0 shows this dynamic: binding commitments weakened as competition intensifies
|
|
- [[government designation of safety-conscious AI labs as supply chain risks inverts the regulatory dynamic by penalizing safety constraints rather than enforcing them]] — the Pentagon/Anthropic dynamic may partly explain pressure to weaken formal commitments
|
|
|
|
**Extraction hints:** Two claims worth extracting separately: (1) "RSP v3.0 represents a net weakening of binding safety commitments despite adding transparency infrastructure — the pause commitment removal, RAND Level 4 demotion, and cyber ops removal indicate competitive pressure eroding prior commitments." (2) "Anthropic's October 2026 commitment to interpretability-informed alignment assessment represents the first planned integration of mechanistic interpretability into formal safety threshold evaluation, but is framed as a non-binding roadmap goal rather than a binding policy commitment."
|
|
|
|
**Context:** GovAI (Centre for the Governance of AI) is one of the leading independent AI governance research organizations. Their analysis is considered relatively authoritative on RSP specifics. The LessWrong critique ("Anthropic is Quietly Backpedalling") is from the EA/rationalist community and tends toward more critical interpretations.
|
|
|
|
## Curator Notes
|
|
|
|
PRIMARY CONNECTION: [[voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints]]
|
|
WHY ARCHIVED: Provides specific documented changes in RSP v3.0 that quantify governance weakening — the pause commitment removal and cyber ops removal are the most concrete evidence of the structural weakening thesis
|
|
EXTRACTION HINT: Don't extract as a single claim — the weakening and the innovation (interpretability commitment) should be separate claims, since they pull in opposite directions for B1's "not being treated as such" assessment
|
|
|
|
|
|
## Key Facts
|
|
- RSP v3.0 effective date: February 24, 2026
|
|
- RSP v3.0 specifies only the next capability threshold, not a ladder of future thresholds
|
|
- Frontier Safety Roadmap covers Security / Alignment / Safeguards / Policy domains
|
|
- Periodic Risk Reports scheduled every 3-6 months
|
|
- October 2026 target date for interpretability-informed alignment assessment
|
|
- Independent review triggered only under narrow conditions in RSP v3.0
|
|
- RSP v3.0 explicitly separates unilateral commitments vs. industry recommendations
|