New claims: - voluntary safety pledges collapse under competitive pressure (Anthropic RSP rollback Feb 2026) - government supply chain designation penalizes safety (Pentagon/Anthropic Mar 2026) - models escalate to nuclear war 95% of the time (King's College war games Feb 2026) Enrichments: - alignment tax claim: added 2026 empirical evidence paragraph, cleaned broken links - coordination problem claim: added Anthropic/Pentagon/OpenAI case study, cleaned broken links Pentagon-Agent: Theseus <845F10FB-BC22-40F6-A6A6-F6E4D8F78465> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
3.9 KiB
| description | type | domain | created | source | confidence |
|---|---|---|---|---|---|
| Anthropic's Feb 2026 rollback of its Responsible Scaling Policy proves that even the strongest voluntary safety commitment collapses when the competitive cost exceeds the reputational benefit | claim | ai-alignment | 2026-03-06 | Anthropic RSP v3.0 (Feb 24, 2026); TIME exclusive (Feb 25, 2026); Jared Kaplan statements | likely |
voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints
Anthropic's Responsible Scaling Policy was the industry's strongest self-imposed safety constraint. Its core pledge: never train an AI system above certain capability thresholds without proven safety measures already in place. On February 24, 2026, Anthropic dropped this pledge. Their chief science officer Jared Kaplan stated explicitly: "We didn't really feel, with the rapid advance of AI, that it made sense for us to make unilateral commitments... if competitors are blazing ahead."
This is not a story about Anthropic losing its nerve. It is a structural result. The RSP was a unilateral commitment — no enforcement mechanism, no industry coordination, no regulatory backing. Three forces made it untenable: a "zone of ambiguity" muddling the public case for risk, an anti-regulatory political climate, and requirements at higher capability levels that are "very hard to meet without industry-wide coordination" (Anthropic's own words). The replacement policy only triggers a pause when Anthropic holds both AI race leadership AND faces material catastrophic risk — conditions that may never simultaneously obtain.
The pattern is general. Any voluntary safety pledge that imposes competitive costs will be eroded when: (1) competitors don't adopt equivalent constraints, (2) the capability gap becomes visible to investors and customers, and (3) no external coordination mechanism prevents defection. All three conditions held for Anthropic. The RSP lasted roughly two years.
This directly validates the alignment tax creates a structural race to the bottom because safety training costs capability and rational competitors skip it. The alignment tax isn't theoretical — Anthropic experienced it, measured it, and capitulated to it. And since AI alignment is a coordination problem not a technical problem, the RSP failure demonstrates that technical safety measures embedded in individual organizations cannot substitute for coordination infrastructure across the industry.
The timing is revealing: Anthropic dropped its safety pledge the same week the Pentagon was pressuring them to remove AI guardrails, and the same week OpenAI secured the Pentagon contract Anthropic was losing. The competitive dynamics operated at both commercial and governmental levels simultaneously.
Relevant Notes:
- the alignment tax creates a structural race to the bottom because safety training costs capability and rational competitors skip it -- the RSP rollback is the clearest empirical confirmation of this claim
- AI alignment is a coordination problem not a technical problem -- voluntary pledges are individual solutions to a coordination problem; they structurally cannot work
- safe AI development requires building alignment mechanisms before scaling capability -- Anthropic's original RSP embodied this principle; its abandonment shows the principle cannot be maintained unilaterally
- technology advances exponentially but coordination mechanisms evolve linearly creating a widening gap -- the RSP collapsed because AI capability advanced faster than coordination mechanisms could be built
- adaptive governance outperforms rigid alignment blueprints because superintelligence development has too many unknowns for fixed plans -- Anthropic's shift from categorical pause triggers to conditional assessment is adaptive governance, but without coordination it becomes permissive governance
Topics: