teleo-codex/inbox/queue/2026-02-24-anthropic-rsp-v3-0-frontier-safety-roadmap.md
2026-03-24 00:13:44 +00:00

6.5 KiB

type title author url date domain secondary_domains format status priority tags
source Anthropic Responsible Scaling Policy v3.0 and Frontier Safety Roadmap Anthropic https://www.anthropic.com/responsible-scaling-policy 2026-02-24 ai-alignment
policy-document unprocessed high
rsp
responsible-scaling-policy
frontier-safety-roadmap
capability-thresholds
asl
evaluation-governance
anthropic

Content

RSP v3.0 (effective February 24, 2026) — a comprehensive rewrite from v2.0. Key structural changes:

What changed from v2.0:

  • Replaced hard capability-threshold pause triggers with Frontier Safety Roadmaps + Risk Reports as public accountability mechanism
  • "Clarified which Capability Thresholds would require enhanced safeguards beyond current ASL-3 standards"
  • Disaggregated AI R&D threshold into two: (1) ability to fully automate entry-level AI research work; (2) ability to cause dramatic acceleration in the rate of effective scaling
  • Extended evaluation interval from 3 months to 6 months (rationale: "avoid lower-quality, rushed elicitation")
  • Committed to reevaluate Capability Thresholds whenever upgrading to new Required Safeguards

What remains:

  • ASL-3 safeguards still in effect
  • Capability threshold framework preserved (restructured, not eliminated)
  • External evaluation (METR reviews) continuing

Frontier Safety Roadmap (accessed via https://anthropic.com/responsible-scaling-policy/roadmap): Anthropic describes this as a "self-imposed public accountability mechanism rather than a legally binding contract." Key milestones:

  • April 2026: Launch 1-3 "moonshot R&D" security projects exploring novel protection approaches
  • July 2026: Policy recommendations for policymakers; "regulatory ladder" framework scaling requirements with AI capability levels
  • October 2026: Systematic alignment assessments for Claude's Constitution (interpretability component — "moderate confidence")
  • January 2027: World-class red-teaming matching collective bug bounty; automated attack investigation; comprehensive internal AI activity logging
  • July 2027: Broad security maturity across systems

On interpretability: "interpretability techniques in such a way that it produces meaningful signal beyond behavioral methods alone" — Anthropic notes "moderate confidence" in achieving this by October 2026.

Risk Reports: Published alongside RSP v3.0 at https://anthropic.com/feb-2026-risk-report. The February 2026 Risk Report is a substantially redacted PDF document — limiting external verification of the "quantify risk across all deployed models" commitment.

Stated rationale for rewrite (per rsp-updates page): The "zone of ambiguity" where capabilities approached but didn't definitively pass thresholds; evaluation science insufficiency ("science of model evaluation isn't well-developed enough"); government not moving fast enough; higher-level safeguards not possible without government assistance.

Agent Notes

Why this matters: RSP v3.0 is the most significant self-governance document in the field, and its specific commitments and their accountability structure directly test whether "not being treated as such" is accurate. The Frontier Safety Roadmap is more concrete than anything the previous sessions established from the v2.0 era.

What surprised me: The evaluation interval was EXTENDED from 3 to 6 months, not shortened. The stated rationale (avoiding rushed evaluations) runs counter to the concern that governance can't keep pace — Anthropic is explicitly trading speed for quality in evaluation cycles. Also: the Risk Reports are "redacted" — a document designed to show transparency is substantially inaccessible.

What I expected but didn't find: No operationalization of Dario Amodei's "reliably detect most AI model problems by 2027" claim in any published document. The Frontier Safety Roadmap's October 2026 alignment assessment is far more modest than that framing suggests. The RSP v3.0 doesn't mention a 2027 interpretability milestone.

KB connections:

Extraction hints: Extract one claim about the RSP v3.0 accountability mechanism (what the Frontier Safety Roadmap adds vs. removes vs. v2.0), one claim about the October 2026 alignment assessment as an empirical test for interpretability progress, and one claim about the redacted Risk Report limiting external verification of the "quantified risk" commitment.

Context: RSP v3.0 published February 24, 2026. Accompanies METR's March 2026 review of Claude Opus 4.6. The previous session (2026-03-23) established that RSP v3.0 removed hard thresholds — this session found that characterization was too simple. The thresholds were restructured and a public roadmap added, not eliminated.

Curator Notes (structured handoff for extractor)

PRIMARY CONNECTION: voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints

WHY ARCHIVED: RSP v3.0 is the primary empirical test of whether Anthropic's governance evolution is moving toward or away from structural accountability. The Frontier Safety Roadmap adds concrete milestones not present in v2.0, but the "moderate confidence" on interpretability and redacted Risk Reports are significant limitations.

EXTRACTION HINT: Two competing claims worth developing — (1) RSP v3.0's Frontier Safety Roadmap represents a genuine governance innovation (public grading, concrete milestones, internal forcing function) that goes beyond prior voluntary commitments; (2) RSP v3.0's self-imposed, redacted, and legally-unenforceable structure cannot close the accountability gap identified by independent evaluators. These may coexist as a divergence rather than resolving to one claim.