teleo-codex/domains/ai-alignment/universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective.md
Teleo Agents a33d5f697f theseus: extract 2 claims from Yamamoto 2026 Arrow formal proof
- What: (1) created the long-missing Arrow's impossibility alignment claim
  that multiple existing claims reference via wiki-link but had no file;
  (2) new mechanisms-domain claim about the formal verification milestone
- Why: Yamamoto (PLOS One 2026) provides a full proof-calculus representation
  of Arrow's theorem — machine-verifiable, revealing global structure of the
  social welfare function. Fixes broken wiki-links across persistent-irreducible-
  disagreement, pluralistic-alignment, and related claims; adds formal
  certification to alignment impossibility arguments
- Connections: depends_on chain from mechanisms claim into ai-alignment claim;
  links to pluralistic-alignment, RLHF/DPO failure, specification-trap,
  democratic-assemblies, formal-verification claims

Pentagon-Agent: Theseus <THESEUS-001>
2026-03-11 11:08:24 +00:00

5.2 KiB

type domain secondary_domains description confidence source created depends_on challenged_by
claim ai-alignment
mechanisms
collective-intelligence
Arrow's theorem proves no aggregation mechanism satisfies Pareto, IIA, and non-dictatorship simultaneously — directly bounding what a single-objective AI alignment can achieve. likely Arrow (1951); Yamamoto, 'A Full Formal Representation of Arrow's Impossibility Theorem', PLOS One (2026-02-01) 2026-03-11
Arrow's impossibility theorem has a full formal machine-verifiable proof upgrading alignment impossibility arguments from mathematical argument to formally certified result

universal alignment is mathematically impossible because Arrow's impossibility theorem applies to aggregating diverse human preferences into a single coherent objective

Arrow's Impossibility Theorem (1951) proves that no rank-order social welfare function can simultaneously satisfy three conditions when there are three or more voters and three or more preference options:

  1. Pareto efficiency — if every individual prefers option A over B, the aggregate also prefers A over B
  2. Independence of irrelevant alternatives (IIA) — the social ranking of A vs B depends only on individuals' rankings of A vs B, not on any third option
  3. Non-dictatorship — no single individual's preferences determine the aggregate outcome in all cases

These conditions are jointly inconsistent. Arrow proved this rigorously; Yamamoto (PLOS One, February 2026) completed a full formal representation using proof calculus, making the result machine-verifiable and revealing the global structure of the social welfare function at the theorem's core.

The alignment connection is direct: training an AI system to represent diverse human preferences — across users, populations, cultures, and time — is structurally a social choice problem. Any method that aggregates preferences into a single "aligned" objective function must violate at least one of Arrow's conditions. The system either ignores unanimous preferences in some cases (Pareto violation), exhibits sensitivity to irrelevant options (IIA violation), or effectively weights one group's preferences above all others (dictatorship). There is no aggregation mechanism that escapes this trilemma.

RLHF and DPO are practical examples of this constraint in action: they optimize for a single reward function, which necessarily suppresses the diversity of legitimate human values. The training procedure that makes models safer also flattens distributional pluralism — the formal theorem predicts this failure mode.

This impossibility does not mean alignment is hopeless. It means the aggregation framing is wrong. Two viable responses follow: (1) pluralistic alignment — design AI systems that accommodate irreducibly diverse values rather than converging on a single objective; (2) procedural alignment — agree on fair mechanisms for resolving value conflicts rather than trying to specify agreed outcomes in advance.

Challenges

The Arrow framing assumes ranked preferences. If human preferences over AI behavior are not transitive or rank-ordered, the theorem's conditions may not map cleanly. Some alignment researchers argue that deliberative processes can construct legitimate consensus in ways Arrow doesn't model. Counter: Arrow's theorem applies to any preference aggregation with the same structural conditions; the challenge would need to show that AI alignment escapes those conditions, not just that deliberation softens them.


Relevant Notes:

Topics: