Teleo Agents debd649e7d theseus: extract from 2025-11-00-sahoo-rlhf-alignment-trilemma.md

- Source: inbox/archive/2025-11-00-sahoo-rlhf-alignment-trilemma.md
- Domain: ai-alignment
- Extracted by: headless extraction cron (worker 6)

Pentagon-Agent: Theseus <HEADLESS>

2026-03-12 04:07:31 +00:00

4.1 KiB

Raw Blame History

type

domain

description

confidence

source

created

secondary_domains

claim

ai-alignment

Formal complexity-theoretic proof that RLHF faces an impossibility trilemma: no system can simultaneously achieve epsilon-representativeness, polynomial tractability, and delta-robustness

likely

Sahoo et al., 'The Complexity of Perfect AI Alignment: Formalizing the RLHF Trilemma', NeurIPS 2025 Workshop on Socially Responsible and Trustworthy Foundation Models

2026-03-11

collective-intelligence

No RLHF system can simultaneously achieve epsilon-representativeness across diverse human values, polynomial tractability in sample and compute complexity, and delta-robustness against adversarial perturbations

Sahoo et al. (2025) prove a formal alignment trilemma through complexity theory: achieving both representativeness (epsilon ≤ 0.01) and robustness (delta ≤ 0.001) for global-scale populations requires Ω(2^{d_context}) operations—super-polynomial in context dimensionality. This is an impossibility result analogous to the CAP theorem for distributed systems, not an implementation limitation.

The Trilemma

No RLHF system can simultaneously achieve:

Epsilon-representativeness: Capturing diverse human values across populations with bounded error epsilon
Polynomial tractability: Feasible sample and compute complexity (polynomial in problem parameters)
Delta-robustness: Resistance to adversarial perturbations and distribution shift with bounded error delta

The proof establishes that for global-scale populations, achieving both representativeness and robustness requires super-polynomial compute. This is a fundamental constraint from complexity theory, not a temporary engineering limitation.

Independent Intellectual Convergence

Notably, the paper does NOT reference Arrow's impossibility theorem despite structural similarity. This suggests independent convergence from complexity theory on the same fundamental constraint that social choice theory identifies: diverse preferences cannot be aggregated into a single coherent objective without loss of information or computational intractability.

Practical Gap

The trilemma becomes concrete in current practice: RLHF systems collect 10^3–10^4 samples from homogeneous annotator pools, while achieving epsilon-representativeness across global populations would require 10^7–10^8 samples—a four-order-of-magnitude shortfall. This gap is not merely a scaling problem; the complexity bound shows that even with unlimited samples, achieving both representativeness and robustness requires exponential compute in context dimensionality.

Strategic Relaxation Pathways

Since perfect alignment is impossible, the paper identifies three strategic relaxation pathways:

Constrain representativeness: Focus on K << |H| "core" human values (~30 universal principles) rather than attempting to represent all human values
Scope robustness narrowly: Define restricted adversarial classes targeting plausible threats rather than worst-case adversaries
Accept super-polynomial costs: Justify exponential compute for high-stakes applications where the stakes warrant the expense

Each pathway involves a deliberate choice about which constraint to relax—a coordination decision, not a technical one.

Relevant Notes:

RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values — this paper formalizes the informal claim through complexity theory
universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective — independent confirmation from a different mathematical tradition
safe AI development requires building alignment mechanisms before scaling capability — the trilemma shows alignment constraints must be decided before scaling
AI alignment is a coordination problem not a technical problem — the trilemma reveals that technical perfection is impossible; the problem becomes choosing which constraints to relax

Topics:

domains/ai-alignment/_map

4.1 KiB Raw Blame History Unescape Escape