New claims: - voluntary safety pledges collapse under competitive pressure (Anthropic RSP rollback Feb 2026) - government supply chain designation penalizes safety (Pentagon/Anthropic Mar 2026) - models escalate to nuclear war 95% of the time (King's College war games Feb 2026) Enrichments: - alignment tax claim: added 2026 empirical evidence paragraph, cleaned broken links - coordination problem claim: added Anthropic/Pentagon/OpenAI case study, cleaned broken links Pentagon-Agent: Theseus <845F10FB-BC22-40F6-A6A6-F6E4D8F78465> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
4.1 KiB
| description | type | domain | created | source | confidence |
|---|---|---|---|---|---|
| Safety post-training reduces general utility through forgetting creating competitive pressures where organizations eschew safety to gain capability advantages | claim | livingip | 2026-02-17 | AI Safety Forum discussions; multiple alignment researchers 2025 | likely |
the alignment tax creates a structural race to the bottom because safety training costs capability and rational competitors skip it
The "alignment tax" is the cost -- computational, capability, and competitive -- of making AI systems aligned. Safety post-training can reduce general utility through continual-learning-style forgetting. Running models without pausing to study and test them means faster capability gains but less safety. The structural problem: techniques that increase AI safety at the expense of capabilities lead organizations to eschew safety to gain competitive advantages.
This is a textbook coordination failure. Each individual actor faces the same incentive structure: if your competitor skips safety and gains capability, you either match them or fall behind. The rational individual choice (skip safety) produces the collectively catastrophic outcome (unsafe superhuman AI). The dynamic intensifies at the national level -- if the US and China treat AI development as a race, competitive pressures ultimately harm everyone.
Since AI alignment is a coordination problem not a technical problem, the alignment tax is perhaps the clearest evidence for this claim. Technical alignment solutions that impose costs will be undermined by competitive dynamics unless coordination mechanisms exist to prevent defection. Since existential risks interact as a system of amplifying feedback loops not independent threats, the alignment tax feeds into the broader risk system -- competitive pressure to skip safety amplifies the technical risks from inadequate alignment.
2026 empirical confirmation: On February 24, 2026, Anthropic dropped the core pledge of its Responsible Scaling Policy — the categorical commitment to not train models above capability thresholds without proven safety measures. Chief Science Officer Jared Kaplan stated explicitly: "We didn't really feel, with the rapid advance of AI, that it made sense for us to make unilateral commitments... if competitors are blazing ahead." The RSP was the industry's strongest voluntary safety constraint. It lasted roughly two years before competitive pressure made it untenable. One week later, when Anthropic tried to hold red lines on autonomous weapons in a Pentagon contract, the DoD designated them a supply chain risk and awarded the contract to OpenAI. The alignment tax is not theoretical — it is measured in lost contracts and abandoned safety pledges.
A collective intelligence architecture could potentially make alignment structural rather than a training-time tax. If alignment emerges from the architecture of how agents coordinate -- through protocols, incentive design, and mutual oversight -- rather than being imposed on individual models during training, then alignment stops being a cost that rational actors skip and becomes a property of the coordination infrastructure itself.
Relevant Notes:
-
AI alignment is a coordination problem not a technical problem -- the alignment tax is the clearest evidence for this claim
-
existential risks interact as a system of amplifying feedback loops not independent threats -- competitive pressure amplifies technical alignment risks
-
the first mover to superintelligence likely gains decisive strategic advantage because the gap between leader and followers accelerates during takeoff -- first-mover dynamics intensify the race and the alignment tax
-
trial and error is the only coordination strategy humanity has ever used -- trial and error cannot work when the first failure is the last event
-
voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints -- Anthropic RSP rollback (Feb 2026) is direct empirical confirmation
Topics: