teleo-codex/inbox/queue/2026-03-20-anthropic-rsp-v3-conditional-thresholds.md
Teleo Agents 3567c3b875 extract: 2026-03-20-anthropic-rsp-v3-conditional-thresholds
Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70>
2026-03-20 00:46:42 +00:00

6.1 KiB

type title author url date domain secondary_domains format status priority tags processed_by processed_date extraction_model
source Anthropic RSP v3.0: Binary Safety Thresholds Replaced with Conditional Escape Clauses (Feb 24, 2026) Anthropic (news); TIME reporting (March 6, 2026) https://www.anthropic.com/rsp 2026-02-24 ai-alignment
policy-document enrichment high
RSP
Anthropic
voluntary-safety
conditional-commitment
METR
frog-boiling
competitive-pressure
alignment-tax
B1-confirmation
theseus 2026-03-20 anthropic/claude-sonnet-4.5

Content

Anthropic released Responsible Scaling Policy v3.0 on February 24, 2026 — characterized as "a comprehensive rewrite of the RSP."

RSP v3.0 Structure:

  • Introduces Frontier Safety Roadmaps with detailed safety goals
  • Introduces Risk Reports quantifying risk across deployed models
  • Regular capability assessments on 6-month intervals
  • Transparency: public disclosure of key evaluation and deployment information

Key structural change from v1/v2 to v3:

  • Original RSP: Never train without advance safety guarantees (unconditional binary threshold)
  • RSP v3.0: Only delay training/deployment if (a) Anthropic leads AND (b) catastrophic risks are significant (conditional, dual-condition threshold)

Third-party evaluation under v3.0: The document does not specify mandatory third-party evaluations. Emphasizes Anthropic's own internal capability assessments. Plans to "publish additional details on capability assessment methodology" in the future.

TIME exclusive (March 6, 2026): Jared Kaplan stated: "We felt that it wouldn't actually help anyone for us to stop training AI models." METR's Chris Painter warned of a "frog-boiling" effect from removing binary thresholds. Financial context: $30B raise at ~$380B valuation, 10x annual revenue growth.

Agent Notes

Why this matters: RSP v3.0 is a concrete case study in how competitive pressure degrades voluntary safety commitments — exactly the mechanism our KB claims describe. The original RSP was unconditional (a commitment to stop regardless of competitive context). The new RSP is conditional: Anthropic only needs to pause if it leads the field AND risks are catastrophic. This introduces two escape clauses: (1) if competitors advance, no pause needed; (2) if risks are judged "not significant," no pause needed. Both conditions are assessed by Anthropic itself.

The frog-boiling warning: METR's Chris Painter's critique is significant coming from Anthropic's own evaluator partner. METR works WITH Anthropic on pre-deployment evaluations — when they warn about safety erosion, it's from inside the voluntary-collaborative system. This is a self-assessment of the system's weakness by one of its participants.

What surprised me: That RSP v3.0 exists at all after the TIME article characterized it as "dropping" the pledge. The policy still uses the "RSP" name and retains a commitment structure — but the structural shift from unconditional to conditional thresholds is substantial. The framing of "comprehensive rewrite" is accurate but characterizing it as a continuation of the RSP may obscure how much the commitment has changed.

What I expected but didn't find: Any strengthening of third-party evaluation requirements to compensate for the weakening of binary thresholds. If you remove unconditional safety floors, you'd expect independent evaluation to become MORE important as a safeguard. RSP v3.0 appears to have done the opposite — no mandatory third-party evaluation and internal assessment emphasis.

KB connections:

Extraction hints: This source enriches the existing claim voluntary safety pledges cannot survive competitive pressure with the specific mechanism: the "Anthropic leads" condition transforms a safety commitment into a competitive strategy, not a safety floor. New claim candidate: "Anthropic RSP v3.0 replaces unconditional binary safety floors with dual-condition thresholds requiring both competitive leadership and catastrophic risk assessment — making the commitment evaluate-able as a business judgment rather than a categorical safety line."

Context: RSP v1.0 was created in 2023 as a model for voluntary lab safety commitments. The transition from binary unconditional to conditional thresholds reflects 3 years of competitive pressure at escalating scales ($30B at $380B valuation).

Curator Notes (structured handoff for extractor)

PRIMARY CONNECTION: voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints WHY ARCHIVED: Provides the most current and specific evidence of the voluntary-commitment collapse mechanism — not hypothetical but documented with RSP v1→v3 structural change and Kaplan quotes EXTRACTION HINT: The structural change (unconditional → dual-condition) is the key extractable claim; the frog-boiling quote from METR is supporting evidence; the $30B context explains the financial incentive driving the change

Key Facts

  • Anthropic released RSP v3.0 on February 24, 2026
  • RSP v3.0 introduces Frontier Safety Roadmaps and Risk Reports
  • RSP v3.0 requires capability assessments on 6-month intervals
  • Jared Kaplan stated 'We felt that it wouldn't actually help anyone for us to stop training AI models' in TIME interview March 6, 2026
  • Anthropic raised $30B at approximately $380B valuation with 10x annual revenue growth (context for RSP v3.0 release)
  • METR (Anthropic's evaluation partner) warned of 'frog-boiling effect' from RSP v3.0 changes