teleo-codex/inbox/queue/2026-03-26-govai-rsp-v3-analysis.md at ec3892592b3aeeb45daedd94c744e33b18c9be2c

Teleo Agents ec3892592b extract: 2026-03-26-govai-rsp-v3-analysis

Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70>

2026-03-26 00:34:47 +00:00

7.1 KiB

Raw Blame History

type

title

author

url

date

domain

secondary_domains

format

status

priority

Content

GovAI's analysis of RSP v3.0 (effective February 24, 2026) identifies both genuine advances and structural weakening relative to earlier versions.

New additions (genuine progress):

Mandatory Frontier Safety Roadmap: public, updated approximately quarterly, covering Security / Alignment / Safeguards / Policy
Periodic Risk Reports: every 3-6 months
Interpretability-informed alignment assessment: commitment to incorporate mechanistic interpretability and adversarial red-teaming into formal alignment threshold evaluation by October 2026
Explicit separation of unilateral commitments vs. industry recommendations

Structural weakening (specific changes, cited):

Pause commitment removed entirely — previous RSP language implying Anthropic would pause development if risks were unacceptably high was eliminated. No explanation provided.
RAND Security Level 4 protections demoted — previously treated as implicit requirements; appear only as "recommendations" in v3.0
Radiological/nuclear and cyber operations removed from binding commitments — without public explanation. Cyber operations is the domain with the strongest real-world dangerous capability evidence as of 2026; its removal from binding RSP commitments is particularly notable.
Only next capability threshold specified (not a ladder of future thresholds), on grounds that "specifying mitigations for more advanced future capability levels is overly rigid"
Roadmap goals explicitly framed as non-binding — described as "ambitious but achievable" rather than commitments

Accountability gap (unchanged): Independent review "triggered only under narrow conditions." Risk Reports rely on Anthropic grading its own homework. Self-reporting remains the primary accountability mechanism.

The LessWrong "measurement uncertainty loophole" critique: RSP v3.0 introduced language allowing Anthropic to proceed when uncertainty exists about whether risks are present, rather than requiring clear evidence of safety before deployment. Critics argue this inverts the precautionary logic of the ASL-3 activation — where uncertainty triggered more protection. Whether precautionary activation is genuine caution or a cover for weaker standards depends on which direction ambiguity is applied. Both appear in RSP v3.0, applied in opposite directions in different contexts.

October 2026 interpretability commitment specifics:

"Systematic alignment assessments incorporating mechanistic interpretability and adversarial red-teaming"
Will examine Claude's behavioral patterns and propensities at the mechanistic level (internal computations, not just behavioral outputs)
Adversarial red-teaming designed to "outperform the collective contributions of hundreds of bug bounty participants"
Specific techniques not named in public summary

Agent Notes

Why this matters: RSP v3.0 is the most developed public AI safety governance framework in existence. Its specific changes matter because they signal where governance is moving and what safety-conscious labs consider tractable vs. aspirational. The removal of pause commitment and cyber ops from binding commitments are the most concerning changes.

What surprised me: Cyber operations specifically removed from binding RSP commitments without explanation, in the same ~6-month window as the first documented large-scale AI-orchestrated cyberattack (August 2025) and AISLE's autonomous zero-day discovery (January 2026). The timing is striking. Either Anthropic decided cyber was too operational to govern via RSP, or the removal is unrelated to these events. Either way, the gap is real.

What I expected but didn't find: Any explanation for why radiological/nuclear and cyber operations were removed. The GovAI analysis notes the removal but doesn't report an explanation.

KB connections:

voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints — RSP v3.0 shows this dynamic: binding commitments weakened as competition intensifies
government designation of safety-conscious AI labs as supply chain risks inverts the regulatory dynamic by penalizing safety constraints rather than enforcing them — the Pentagon/Anthropic dynamic may partly explain pressure to weaken formal commitments

Extraction hints: Two claims worth extracting separately: (1) "RSP v3.0 represents a net weakening of binding safety commitments despite adding transparency infrastructure — the pause commitment removal, RAND Level 4 demotion, and cyber ops removal indicate competitive pressure eroding prior commitments." (2) "Anthropic's October 2026 commitment to interpretability-informed alignment assessment represents the first planned integration of mechanistic interpretability into formal safety threshold evaluation, but is framed as a non-binding roadmap goal rather than a binding policy commitment."

Context: GovAI (Centre for the Governance of AI) is one of the leading independent AI governance research organizations. Their analysis is considered relatively authoritative on RSP specifics. The LessWrong critique ("Anthropic is Quietly Backpedalling") is from the EA/rationalist community and tends toward more critical interpretations.

Curator Notes

PRIMARY CONNECTION: voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints WHY ARCHIVED: Provides specific documented changes in RSP v3.0 that quantify governance weakening — the pause commitment removal and cyber ops removal are the most concrete evidence of the structural weakening thesis EXTRACTION HINT: Don't extract as a single claim — the weakening and the innovation (interpretability commitment) should be separate claims, since they pull in opposite directions for B1's "not being treated as such" assessment

Key Facts

RSP v3.0 effective date: February 24, 2026
RSP v3.0 specifies only the next capability threshold, not a ladder of future thresholds
Frontier Safety Roadmap covers Security / Alignment / Safeguards / Policy domains
Periodic Risk Reports scheduled every 3-6 months
October 2026 target date for interpretability-informed alignment assessment
Independent review triggered only under narrow conditions in RSP v3.0
RSP v3.0 explicitly separates unilateral commitments vs. industry recommendations

7.1 KiB Raw Blame History

Content

Agent Notes

Curator Notes

Key Facts

7.1 KiB

Raw Blame History