- Replace [[AI alignment approaches]] with [[domains/ai-alignment/_map]] in 5 foundations/collective-intelligence/ claims and 1 core/living-agents/ claim (6 fixes total — topic tag had no corresponding file) - Replace [[core/_map]] with [[foundations/collective-intelligence/_map]] in 2 CI claims (core/_map.md doesn't exist) - Add 3 new claims from PR #20 to domains/ai-alignment/_map.md: voluntary safety pledges, government supply chain designation, nuclear war escalation in LLM simulations Pentagon-Agent: Theseus <845F10FB-BC22-40F6-A6A6-F6E4D8F78465>
4.1 KiB
| description | type | domain | created | source | confidence | tradition |
|---|---|---|---|---|---|---|
| Even individually aligned AI systems competing in an environment without safety incentives produce catastrophic externalities like pollution where no actor wants the outcome but each contributes | claim | livingip | 2026-02-17 | Critch & Krueger, ARCHES (arXiv 2006.04948, June 2020); Critch, What Multipolar Failure Looks Like (Alignment Forum); Carichon et al, Multi-Agent Misalignment Crisis (arXiv 2506.01080, June 2025) | likely | game theory, institutional economics |
multipolar failure from competing aligned AI systems may pose greater existential risk than any single misaligned superintelligence
Andrew Critch (UC Berkeley, CHAI) makes the clearest case that the most likely source of existential risk from AI is not a single misaligned superintelligence but multipolar failure -- negative externalities from multiple AI systems and stakeholders competing in an environment where safety is not covered by market incentives. The analogy is pollution: no one wants a polluted atmosphere, but each actor is willing to pollute a little. The result is catastrophic even though each individual actor's behavior may be locally rational or even aligned.
Critch introduces the concept of "prepotent AI" -- AI that is both globally transformative and impossible to turn off through human-coordinated efforts. This is the threshold that makes alignment existential. But prepotence can emerge from a system of interacting agents, not just from a single system.
Carichon et al (Mila/McGill, 2025) extend this to formalize "holistic alignment" -- the requirement that multi-agent systems respect values and preferences of all entities, not just each agent's principal. They argue alignment in multi-agent systems must be dynamic, interaction-dependent, and heavily shaped by whether the social environment is collaborative, cooperative, or competitive.
This reframes the alignment problem. Since AI alignment is a coordination problem not a technical problem, multipolar failure is the specific coordination failure mode that matters most. Since the alignment tax creates a structural race to the bottom because safety training costs capability and rational competitors skip it, competitive dynamics between aligned systems reproduce the same race-to-the-bottom dynamic that exists between labs. Individual alignment is necessary but insufficient -- the system-level dynamics of many aligned agents competing can still produce catastrophic outcomes.
The implication for TeleoHumanity: since collective superintelligence is the alternative to monolithic AI controlled by a few, a collective architecture where agents coordinate through shared protocols may be the only design that prevents multipolar failure by making cooperation structural rather than optional.
Relevant Notes:
- AI alignment is a coordination problem not a technical problem -- multipolar failure is the specific coordination failure that makes individual alignment insufficient
- the alignment tax creates a structural race to the bottom because safety training costs capability and rational competitors skip it -- competitive dynamics between aligned systems reproduce the race-to-the-bottom
- collective superintelligence is the alternative to monolithic AI controlled by a few -- collective architecture makes cooperation structural, preventing multipolar failure
- existential risks interact as a system of amplifying feedback loops not independent threats -- multipolar failure is the AI-specific instance of interacting risks
- the first mover to superintelligence likely gains decisive strategic advantage because the gap between leader and followers accelerates during takeoff -- multipolar dynamics may prevent singleton formation but create new failure modes
- minsky's financial instability hypothesis shows that stability breeds instability as good times incentivize leverage and risk-taking that fragilize the system until shocks trigger cascades -- financial markets are a concrete example of multipolar failure from locally rational actors
Topics: