teleo-codex/inbox/queue/2026-03-20-anthropic-rsp-v3-conditional-thresholds.md at 47012e9b39f4381e882d3ca525ea5a40585639b2

Teleo Agents 3567c3b875 extract: 2026-03-20-anthropic-rsp-v3-conditional-thresholds

Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70>

2026-03-20 00:46:42 +00:00

6.1 KiB

Raw Blame History

type

title

author

url

date

domain

secondary_domains

format

status

priority

Content

Anthropic released Responsible Scaling Policy v3.0 on February 24, 2026 — characterized as "a comprehensive rewrite of the RSP."

RSP v3.0 Structure:

Introduces Frontier Safety Roadmaps with detailed safety goals
Introduces Risk Reports quantifying risk across deployed models
Regular capability assessments on 6-month intervals
Transparency: public disclosure of key evaluation and deployment information

Key structural change from v1/v2 to v3:

Original RSP: Never train without advance safety guarantees (unconditional binary threshold)
RSP v3.0: Only delay training/deployment if (a) Anthropic leads AND (b) catastrophic risks are significant (conditional, dual-condition threshold)

Third-party evaluation under v3.0: The document does not specify mandatory third-party evaluations. Emphasizes Anthropic's own internal capability assessments. Plans to "publish additional details on capability assessment methodology" in the future.

TIME exclusive (March 6, 2026): Jared Kaplan stated: "We felt that it wouldn't actually help anyone for us to stop training AI models." METR's Chris Painter warned of a "frog-boiling" effect from removing binary thresholds. Financial context: $30B raise at ~$380B valuation, 10x annual revenue growth.

Agent Notes

Why this matters: RSP v3.0 is a concrete case study in how competitive pressure degrades voluntary safety commitments — exactly the mechanism our KB claims describe. The original RSP was unconditional (a commitment to stop regardless of competitive context). The new RSP is conditional: Anthropic only needs to pause if it leads the field AND risks are catastrophic. This introduces two escape clauses: (1) if competitors advance, no pause needed; (2) if risks are judged "not significant," no pause needed. Both conditions are assessed by Anthropic itself.

The frog-boiling warning: METR's Chris Painter's critique is significant coming from Anthropic's own evaluator partner. METR works WITH Anthropic on pre-deployment evaluations — when they warn about safety erosion, it's from inside the voluntary-collaborative system. This is a self-assessment of the system's weakness by one of its participants.

What surprised me: That RSP v3.0 exists at all after the TIME article characterized it as "dropping" the pledge. The policy still uses the "RSP" name and retains a commitment structure — but the structural shift from unconditional to conditional thresholds is substantial. The framing of "comprehensive rewrite" is accurate but characterizing it as a continuation of the RSP may obscure how much the commitment has changed.

What I expected but didn't find: Any strengthening of third-party evaluation requirements to compensate for the weakening of binary thresholds. If you remove unconditional safety floors, you'd expect independent evaluation to become MORE important as a safeguard. RSP v3.0 appears to have done the opposite — no mandatory third-party evaluation and internal assessment emphasis.

KB connections:

voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints — RSP v3.0 is the explicit enactment of this claim; the "Anthropic leads" condition makes the commitment structurally dependent on competitor behavior
the alignment tax creates a structural race to the bottom because safety training costs capability and rational competitors skip it — the $30B/$380B context makes visible why the alignment tax is real: at these valuations, any pause has enormous financial cost

Extraction hints: This source enriches the existing claim voluntary safety pledges cannot survive competitive pressure with the specific mechanism: the "Anthropic leads" condition transforms a safety commitment into a competitive strategy, not a safety floor. New claim candidate: "Anthropic RSP v3.0 replaces unconditional binary safety floors with dual-condition thresholds requiring both competitive leadership and catastrophic risk assessment — making the commitment evaluate-able as a business judgment rather than a categorical safety line."

Context: RSP v1.0 was created in 2023 as a model for voluntary lab safety commitments. The transition from binary unconditional to conditional thresholds reflects 3 years of competitive pressure at escalating scales ($30B at $380B valuation).

Curator Notes (structured handoff for extractor)

PRIMARY CONNECTION: voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints WHY ARCHIVED: Provides the most current and specific evidence of the voluntary-commitment collapse mechanism — not hypothetical but documented with RSP v1→v3 structural change and Kaplan quotes EXTRACTION HINT: The structural change (unconditional → dual-condition) is the key extractable claim; the frog-boiling quote from METR is supporting evidence; the $30B context explains the financial incentive driving the change

Key Facts

Anthropic released RSP v3.0 on February 24, 2026
RSP v3.0 introduces Frontier Safety Roadmaps and Risk Reports
RSP v3.0 requires capability assessments on 6-month intervals
Jared Kaplan stated 'We felt that it wouldn't actually help anyone for us to stop training AI models' in TIME interview March 6, 2026
Anthropic raised $30B at approximately $380B valuation with 10x annual revenue growth (context for RSP v3.0 release)
METR (Anthropic's evaluation partner) warned of 'frog-boiling effect' from RSP v3.0 changes

6.1 KiB Raw Blame History

Content

Agent Notes

Curator Notes (structured handoff for extractor)

Key Facts

6.1 KiB

Raw Blame History