teleo-codex/domains/ai-alignment/universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective.md
Teleo Agents 62825c3995 auto-fix: address review feedback on 2026-02-00-yamamoto-full-formal-arrow-impossibility.md
- Fixed based on eval review comments
- Quality gate pass 3 (fix-from-feedback)

Pentagon-Agent: Theseus <HEADLESS>
2026-03-14 15:27:14 +00:00

41 lines
5.6 KiB
Markdown

---
description: Arrow's impossibility theorem mathematically proves that no social choice function can simultaneously satisfy basic fairness criteria, constraining any attempt to aggregate diverse human preferences into a single coherent objective function
type: claim
domain: collective-intelligence
secondary_domains: [ai-alignment, mechanisms]
created: 2026-02-17
confidence: likely
source: "Arrow (1951), Conitzer & Mishra (ICML 2024), Mishra (2023)"
challenged_by: []
---
# universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective
Arrow's impossibility theorem (1951) proves that no social choice function can simultaneously satisfy four minimal fairness criteria: unrestricted domain (all preference orderings allowed), non-dictatorship (no single voter determines outcomes), Pareto efficiency (if everyone prefers X to Y, the aggregate prefers X to Y), and independence of irrelevant alternatives (the aggregate ranking of X vs Y depends only on individual rankings of X vs Y). The theorem's core insight: any attempt to aggregate diverse ordinal preferences into a single consistent ranking must violate at least one criterion.
Conitzer and Mishra (ICML 2024) apply this directly to AI alignment: RLHF-style preference aggregation faces structurally identical constraints. When training systems on diverse human feedback, you cannot simultaneously satisfy: (1) accepting all possible preference orderings from humans, (2) ensuring no single human's preferences dominate, (3) respecting Pareto improvements (if all humans prefer outcome A, the system should too), and (4) making aggregation decisions independent of irrelevant alternatives. Any alignment mechanism that attempts universal preference aggregation must fail one of these criteria.
Mishra (2023) extends this: the impossibility isn't a limitation of current RLHF implementations—it's a fundamental constraint on *any* mechanism attempting to aggregate diverse human values into a single objective. This means alignment strategies that depend on "finding the right aggregation function" are pursuing an impossible goal. The mathematical structure of preference aggregation itself forbids the outcome.
The escape routes are well-known but costly: (1) restrict the domain of acceptable preferences (some humans' values are excluded), (2) accept dictatorship (one human or group's preferences dominate), (3) abandon Pareto efficiency (systems can ignore unanimous human preferences), or (4) use cardinal utility aggregation (utilitarian summation) rather than ordinal ranking, which sidesteps Arrow's theorem but requires interpersonal utility comparisons that are philosophically contested and practically difficult to implement.
The alignment implication: universal alignment—a single objective function that respects all human values equally—is mathematically impossible. Alignment strategies must either (a) explicitly choose which criterion to violate, or (b) abandon the goal of universal aggregation in favor of domain-restricted, hierarchical, or pluralistic approaches.
## Additional Evidence
### Formal Machine-Verifiable Proof (extend)
*Source: Yamamoto (PLOS One, 2026-02-01) | Added: 2026-03-11 | Extractor: anthropic/claude-sonnet-4.5*
Arrow's impossibility theorem now has a full formal representation using proof calculus in formal logic (Yamamoto, PLOS One, February 2026). This provides a machine-checkable representation suitable for formal verification pipelines, meaning automated systems can now cite Arrow's theorem as a formally verified result rather than relying on external mathematical claims. The formal proof complements existing computer-aided proofs (Tang & Lin 2009, *Artificial Intelligence*) and simplified proofs via Condorcet's paradox with a complete logical derivation revealing the global structure of the social welfare function central to the theorem. While Arrow's theorem itself has been mathematically established since 1951, the formal representation enables integration into automated reasoning systems and formal verification pipelines used in AI safety research.
## Relevant Notes
- [[intelligence and goals are orthogonal so a superintelligence can be maximally competent while pursuing arbitrary or destructive ends]] -- if goals cannot be unified across diverse humans, superintelligence amplifies the problem
- [[pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state]] -- Arrow's theorem explains why convergence is impossible; pluralism is the structural response
- [[safe AI development requires building alignment mechanisms before scaling capability]] -- the impossibility of universal alignment makes phased safety-first development more urgent, not less
- [[the specification trap means any values encoded at training time become structurally unstable as deployment contexts diverge from training conditions]] -- Arrow's constraints apply at every deployment context; no fixed specification can satisfy all criteria
- [[super co-alignment proposes that human and AI values should be co-shaped through iterative alignment rather than specified in advance]] -- co-shaping is one response to Arrow's impossibility: abandon fixed aggregation in favor of continuous negotiation
- [[adaptive governance outperforms rigid alignment blueprints because superintelligence development has too many unknowns for fixed plans]] -- Arrow's theorem shows why rigid blueprints fail; adaptive governance is structurally necessary
## Topics
- [[core/mechanisms/_map]]
- [[domains/ai-alignment/_map]]