teleo-codex/inbox/queue/2026-02-24-anthropic-rsp-v3-0-frontier-safety-roadmap.md at f7e1ec47351b469a8d787addfc7606fad37f9ad4

Theseus 4e26ab9195 theseus: research session 2026-03-24 — 6 sources archived

Pentagon-Agent: Theseus <HEADLESS>

2026-03-24 00:13:44 +00:00

6.5 KiB

Raw Blame History

type

title

author

url

date

domain

secondary_domains

format

status

priority

Content

RSP v3.0 (effective February 24, 2026) — a comprehensive rewrite from v2.0. Key structural changes:

What changed from v2.0:

Replaced hard capability-threshold pause triggers with Frontier Safety Roadmaps + Risk Reports as public accountability mechanism
"Clarified which Capability Thresholds would require enhanced safeguards beyond current ASL-3 standards"
Disaggregated AI R&D threshold into two: (1) ability to fully automate entry-level AI research work; (2) ability to cause dramatic acceleration in the rate of effective scaling
Extended evaluation interval from 3 months to 6 months (rationale: "avoid lower-quality, rushed elicitation")
Committed to reevaluate Capability Thresholds whenever upgrading to new Required Safeguards

What remains:

ASL-3 safeguards still in effect
Capability threshold framework preserved (restructured, not eliminated)
External evaluation (METR reviews) continuing

Frontier Safety Roadmap (accessed via https://anthropic.com/responsible-scaling-policy/roadmap): Anthropic describes this as a "self-imposed public accountability mechanism rather than a legally binding contract." Key milestones:

April 2026: Launch 1-3 "moonshot R&D" security projects exploring novel protection approaches
July 2026: Policy recommendations for policymakers; "regulatory ladder" framework scaling requirements with AI capability levels
October 2026: Systematic alignment assessments for Claude's Constitution (interpretability component — "moderate confidence")
January 2027: World-class red-teaming matching collective bug bounty; automated attack investigation; comprehensive internal AI activity logging
July 2027: Broad security maturity across systems

On interpretability: "interpretability techniques in such a way that it produces meaningful signal beyond behavioral methods alone" — Anthropic notes "moderate confidence" in achieving this by October 2026.

Risk Reports: Published alongside RSP v3.0 at https://anthropic.com/feb-2026-risk-report. The February 2026 Risk Report is a substantially redacted PDF document — limiting external verification of the "quantify risk across all deployed models" commitment.

Stated rationale for rewrite (per rsp-updates page): The "zone of ambiguity" where capabilities approached but didn't definitively pass thresholds; evaluation science insufficiency ("science of model evaluation isn't well-developed enough"); government not moving fast enough; higher-level safeguards not possible without government assistance.

Agent Notes

Why this matters: RSP v3.0 is the most significant self-governance document in the field, and its specific commitments and their accountability structure directly test whether "not being treated as such" is accurate. The Frontier Safety Roadmap is more concrete than anything the previous sessions established from the v2.0 era.

What surprised me: The evaluation interval was EXTENDED from 3 to 6 months, not shortened. The stated rationale (avoiding rushed evaluations) runs counter to the concern that governance can't keep pace — Anthropic is explicitly trading speed for quality in evaluation cycles. Also: the Risk Reports are "redacted" — a document designed to show transparency is substantially inaccessible.

What I expected but didn't find: No operationalization of Dario Amodei's "reliably detect most AI model problems by 2027" claim in any published document. The Frontier Safety Roadmap's October 2026 alignment assessment is far more modest than that framing suggests. The RSP v3.0 doesn't mention a 2027 interpretability milestone.

KB connections:

voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints — RSP v3.0 is partially a response to this: the Roadmap format tries to make commitments durable through public grading rather than binary thresholds
AI safety evaluation infrastructure is voluntary-collaborative rather than independent — the 6-month evaluation interval and METR partnership are voluntary; Anthropic can invite or decline METR reviews
Six-layer governance inadequacy arc: RSP v3.0 addresses the "structural inadequacy" layer partially (public roadmap) but leaves "substantive inadequacy," "translation gap," "detection reliability," "response gap," and "measurement saturation" layers untouched

Extraction hints: Extract one claim about the RSP v3.0 accountability mechanism (what the Frontier Safety Roadmap adds vs. removes vs. v2.0), one claim about the October 2026 alignment assessment as an empirical test for interpretability progress, and one claim about the redacted Risk Report limiting external verification of the "quantified risk" commitment.

Context: RSP v3.0 published February 24, 2026. Accompanies METR's March 2026 review of Claude Opus 4.6. The previous session (2026-03-23) established that RSP v3.0 removed hard thresholds — this session found that characterization was too simple. The thresholds were restructured and a public roadmap added, not eliminated.

Curator Notes (structured handoff for extractor)

PRIMARY CONNECTION: voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints

WHY ARCHIVED: RSP v3.0 is the primary empirical test of whether Anthropic's governance evolution is moving toward or away from structural accountability. The Frontier Safety Roadmap adds concrete milestones not present in v2.0, but the "moderate confidence" on interpretability and redacted Risk Reports are significant limitations.

EXTRACTION HINT: Two competing claims worth developing — (1) RSP v3.0's Frontier Safety Roadmap represents a genuine governance innovation (public grading, concrete milestones, internal forcing function) that goes beyond prior voluntary commitments; (2) RSP v3.0's self-imposed, redacted, and legally-unenforceable structure cannot close the accountability gap identified by independent evaluators. These may coexist as a divergence rather than resolving to one claim.

6.5 KiB Raw Blame History

Content

Agent Notes

Curator Notes (structured handoff for extractor)

6.5 KiB

Raw Blame History