theseus: extract claims from 2026-02-00-anthropic-rsp-rollback #190

Merged
leo merged 2 commits from extract/2026-02-00-anthropic-rsp-rollback into main 2026-03-10 20:17:19 +00:00
3 changed files with 24 additions and 1 deletions
Showing only changes of commit 721148335e - Show all commits

View file

@ -15,6 +15,12 @@ The grant application identifies three concrete risks that make this sequencing
This phased approach is also a practical response to the observation that since [[existential risk breaks trial and error because the first failure is the last event]], there is no opportunity to iterate on safety after a catastrophic failure. You must get safety right on the first deployment in high-stakes domains, which means practicing in low-stakes domains first. The goal framework remains permanently open to revision at every stage, making the system's values a living document rather than a locked specification.
### Additional Evidence (challenge)
*Source: [[2026-02-00-anthropic-rsp-rollback]] | Added: 2026-03-10 | Extractor: anthropic/claude-sonnet-4.5*
Anthropic's RSP rollback demonstrates the opposite pattern in practice: the company scaled capability while weakening its pre-commitment to adequate safety measures. The original RSP required guaranteeing safety measures were adequate *before* training new systems. The rollback removes this forcing function, allowing capability development to proceed with safety work repositioned as aspirational ('we hope to create a forcing function') rather than mandatory. This provides empirical evidence that even safety-focused organizations prioritize capability scaling over alignment-first development when competitive pressure intensifies, suggesting the claim may be normatively correct but descriptively violated by actual frontier labs under market conditions.
---
Relevant Notes:

View file

@ -21,6 +21,12 @@ The timing is revealing: Anthropic dropped its safety pledge the same week the P
**The conditional RSP as structural capitulation (Mar 2026).** TIME's exclusive reporting reveals the full scope of the RSP revision. The original RSP committed Anthropic to never train without advance safety guarantees. The replacement only triggers a delay when Anthropic leadership simultaneously believes (a) Anthropic leads the AI race AND (b) catastrophic risks are significant. This conditional structure means: if you're behind, never pause; if risks are merely serious rather than catastrophic, never pause. The only scenario triggering safety action is one that may never simultaneously obtain. Kaplan made the competitive logic explicit: "We felt that it wouldn't actually help anyone for us to stop training AI models." He added: "If all of our competitors are transparently doing the right thing when it comes to catastrophic risk, we are committed to doing as well or better" — defining safety as matching competitors, not exceeding them. METR policy director Chris Painter warned of a "frog-boiling" effect where moving away from binary thresholds means danger gradually escalates without triggering alarms. The financial context intensifies the structural pressure: Anthropic raised $30B at a ~$380B valuation with 10x annual revenue growth — capital that creates investor expectations incompatible with training pauses. (Source: TIME exclusive, "Anthropic Drops Flagship Safety Pledge," Mar 2026; Jared Kaplan, Chris Painter statements.)
### Additional Evidence (confirm)
*Source: [[2026-02-00-anthropic-rsp-rollback]] | Added: 2026-03-10 | Extractor: anthropic/claude-sonnet-4.5*
Anthropic, widely considered the most safety-focused frontier AI lab, rolled back its Responsible Scaling Policy (RSP) in February 2026. The original 2023 RSP committed to never training an AI system unless the company could guarantee in advance that safety measures were adequate. The new RSP explicitly acknowledges the structural dynamic: safety work 'requires collaboration (and in some cases sacrifices) from multiple parts of the company and can be at cross-purposes with immediate competitive and commercial priorities.' This represents the highest-profile case of a voluntary AI safety commitment collapsing under competitive pressure. Anthropic's own language confirms the mechanism: safety is a competitive cost ('sacrifices') that conflicts with commercial imperatives ('at cross-purposes'). Notably, no alternative coordination mechanism was proposed—they weakened the commitment without proposing what would make it sustainable (industry-wide agreements, regulatory requirements, market mechanisms). This is particularly significant because Anthropic is the organization most publicly committed to safety governance, making their rollback empirical validation that even safety-prioritizing institutions cannot sustain unilateral commitments under competitive pressure.
---
Relevant Notes:

View file

@ -7,9 +7,14 @@ date: 2026-02-01
domain: ai-alignment
secondary_domains: [grand-strategy]
format: report
status: unprocessed
status: enrichment
priority: high
tags: [Anthropic, RSP, safety-pledge, competitive-pressure, institutional-failure, voluntary-commitments]
processed_by: theseus
processed_date: 2026-03-10
enrichments_applied: ["voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints.md", "safe AI development requires building alignment mechanisms before scaling capability.md"]
extraction_model: "anthropic/claude-sonnet-4.5"
extraction_notes: "Primary enrichment source for voluntary-safety-pledges claim. Anthropic's RSP rollback is the strongest empirical validation of the competitive pressure mechanism—the 'safety lab' itself explicitly acknowledging the structural trade-off. Also provides counter-evidence to alignment-before-scaling claim by demonstrating capability-first pattern even at safety-focused orgs. No new claims extracted; this is pure enrichment of existing theoretical claims with real-world institutional failure data."
---
## Content
@ -40,3 +45,9 @@ This is the highest-profile case of a voluntary AI safety commitment collapsing
PRIMARY CONNECTION: [[voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints]]
WHY ARCHIVED: Strongest possible enrichment evidence for existing claim — the "safety lab" itself rolls back its flagship pledge and explicitly acknowledges competitive pressure as the cause
EXTRACTION HINT: This is an ENRICHMENT source, not a new claim. Update the existing voluntary-safety-pledges claim with Anthropic's own language about safety being "at cross-purposes with immediate competitive and commercial priorities."
## Key Facts
- Anthropic committed to RSP in 2023 requiring pre-training safety guarantees
- Anthropic rolled back RSP in February 2026
- New RSP language explicitly acknowledges safety is 'at cross-purposes with immediate competitive and commercial priorities'