teleo-codex/foundations/collective-intelligence/multipolar failure from competing aligned AI systems may pose greater existential risk than any single misaligned superintelligence.md
m3taversal d7025e65dd theseus: fix dangling topic links and update domain map
- Replace [[AI alignment approaches]] with [[domains/ai-alignment/_map]]
  in 5 foundations/collective-intelligence/ claims and 1 core/living-agents/
  claim (6 fixes total — topic tag had no corresponding file)
- Replace [[core/_map]] with [[foundations/collective-intelligence/_map]]
  in 2 CI claims (core/_map.md doesn't exist)
- Add 3 new claims from PR #20 to domains/ai-alignment/_map.md:
  voluntary safety pledges, government supply chain designation,
  nuclear war escalation in LLM simulations

Pentagon-Agent: Theseus <845F10FB-BC22-40F6-A6A6-F6E4D8F78465>
2026-03-06 13:09:04 +00:00

4.1 KiB

description type domain created source confidence tradition
Even individually aligned AI systems competing in an environment without safety incentives produce catastrophic externalities like pollution where no actor wants the outcome but each contributes claim livingip 2026-02-17 Critch & Krueger, ARCHES (arXiv 2006.04948, June 2020); Critch, What Multipolar Failure Looks Like (Alignment Forum); Carichon et al, Multi-Agent Misalignment Crisis (arXiv 2506.01080, June 2025) likely game theory, institutional economics

multipolar failure from competing aligned AI systems may pose greater existential risk than any single misaligned superintelligence

Andrew Critch (UC Berkeley, CHAI) makes the clearest case that the most likely source of existential risk from AI is not a single misaligned superintelligence but multipolar failure -- negative externalities from multiple AI systems and stakeholders competing in an environment where safety is not covered by market incentives. The analogy is pollution: no one wants a polluted atmosphere, but each actor is willing to pollute a little. The result is catastrophic even though each individual actor's behavior may be locally rational or even aligned.

Critch introduces the concept of "prepotent AI" -- AI that is both globally transformative and impossible to turn off through human-coordinated efforts. This is the threshold that makes alignment existential. But prepotence can emerge from a system of interacting agents, not just from a single system.

Carichon et al (Mila/McGill, 2025) extend this to formalize "holistic alignment" -- the requirement that multi-agent systems respect values and preferences of all entities, not just each agent's principal. They argue alignment in multi-agent systems must be dynamic, interaction-dependent, and heavily shaped by whether the social environment is collaborative, cooperative, or competitive.

This reframes the alignment problem. Since AI alignment is a coordination problem not a technical problem, multipolar failure is the specific coordination failure mode that matters most. Since the alignment tax creates a structural race to the bottom because safety training costs capability and rational competitors skip it, competitive dynamics between aligned systems reproduce the same race-to-the-bottom dynamic that exists between labs. Individual alignment is necessary but insufficient -- the system-level dynamics of many aligned agents competing can still produce catastrophic outcomes.

The implication for TeleoHumanity: since collective superintelligence is the alternative to monolithic AI controlled by a few, a collective architecture where agents coordinate through shared protocols may be the only design that prevents multipolar failure by making cooperation structural rather than optional.


Relevant Notes:

Topics: