teleo-codex/domains/ai-alignment/voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints.md
m3taversal 235d12d0a2 theseus: add 3 claims from Anthropic/Pentagon/nuclear news + enrich 2 foundations
New claims:
- voluntary safety pledges collapse under competitive pressure (Anthropic RSP rollback Feb 2026)
- government supply chain designation penalizes safety (Pentagon/Anthropic Mar 2026)
- models escalate to nuclear war 95% of the time (King's College war games Feb 2026)

Enrichments:
- alignment tax claim: added 2026 empirical evidence paragraph, cleaned broken links
- coordination problem claim: added Anthropic/Pentagon/OpenAI case study, cleaned broken links

Pentagon-Agent: Theseus <845F10FB-BC22-40F6-A6A6-F6E4D8F78465>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-06 12:41:42 +00:00

3.9 KiB

description type domain created source confidence
Anthropic's Feb 2026 rollback of its Responsible Scaling Policy proves that even the strongest voluntary safety commitment collapses when the competitive cost exceeds the reputational benefit claim ai-alignment 2026-03-06 Anthropic RSP v3.0 (Feb 24, 2026); TIME exclusive (Feb 25, 2026); Jared Kaplan statements likely

voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints

Anthropic's Responsible Scaling Policy was the industry's strongest self-imposed safety constraint. Its core pledge: never train an AI system above certain capability thresholds without proven safety measures already in place. On February 24, 2026, Anthropic dropped this pledge. Their chief science officer Jared Kaplan stated explicitly: "We didn't really feel, with the rapid advance of AI, that it made sense for us to make unilateral commitments... if competitors are blazing ahead."

This is not a story about Anthropic losing its nerve. It is a structural result. The RSP was a unilateral commitment — no enforcement mechanism, no industry coordination, no regulatory backing. Three forces made it untenable: a "zone of ambiguity" muddling the public case for risk, an anti-regulatory political climate, and requirements at higher capability levels that are "very hard to meet without industry-wide coordination" (Anthropic's own words). The replacement policy only triggers a pause when Anthropic holds both AI race leadership AND faces material catastrophic risk — conditions that may never simultaneously obtain.

The pattern is general. Any voluntary safety pledge that imposes competitive costs will be eroded when: (1) competitors don't adopt equivalent constraints, (2) the capability gap becomes visible to investors and customers, and (3) no external coordination mechanism prevents defection. All three conditions held for Anthropic. The RSP lasted roughly two years.

This directly validates the alignment tax creates a structural race to the bottom because safety training costs capability and rational competitors skip it. The alignment tax isn't theoretical — Anthropic experienced it, measured it, and capitulated to it. And since AI alignment is a coordination problem not a technical problem, the RSP failure demonstrates that technical safety measures embedded in individual organizations cannot substitute for coordination infrastructure across the industry.

The timing is revealing: Anthropic dropped its safety pledge the same week the Pentagon was pressuring them to remove AI guardrails, and the same week OpenAI secured the Pentagon contract Anthropic was losing. The competitive dynamics operated at both commercial and governmental levels simultaneously.


Relevant Notes:

Topics: