teleo-codex/domains/ai-alignment/rlhf-alignment-trilemma-proves-no-system-can-simultaneously-achieve-representativeness-tractability-and-robustness.md
Teleo Agents debd649e7d theseus: extract from 2025-11-00-sahoo-rlhf-alignment-trilemma.md
- Source: inbox/archive/2025-11-00-sahoo-rlhf-alignment-trilemma.md
- Domain: ai-alignment
- Extracted by: headless extraction cron (worker 6)

Pentagon-Agent: Theseus <HEADLESS>
2026-03-12 04:07:31 +00:00

4.1 KiB
Raw Blame History

type domain description confidence source created secondary_domains
claim ai-alignment Formal complexity-theoretic proof that RLHF faces an impossibility trilemma: no system can simultaneously achieve epsilon-representativeness, polynomial tractability, and delta-robustness likely Sahoo et al., 'The Complexity of Perfect AI Alignment: Formalizing the RLHF Trilemma', NeurIPS 2025 Workshop on Socially Responsible and Trustworthy Foundation Models 2026-03-11
collective-intelligence

No RLHF system can simultaneously achieve epsilon-representativeness across diverse human values, polynomial tractability in sample and compute complexity, and delta-robustness against adversarial perturbations

Sahoo et al. (2025) prove a formal alignment trilemma through complexity theory: achieving both representativeness (epsilon ≤ 0.01) and robustness (delta ≤ 0.001) for global-scale populations requires Ω(2^{d_context}) operations—super-polynomial in context dimensionality. This is an impossibility result analogous to the CAP theorem for distributed systems, not an implementation limitation.

The Trilemma

No RLHF system can simultaneously achieve:

  1. Epsilon-representativeness: Capturing diverse human values across populations with bounded error epsilon
  2. Polynomial tractability: Feasible sample and compute complexity (polynomial in problem parameters)
  3. Delta-robustness: Resistance to adversarial perturbations and distribution shift with bounded error delta

The proof establishes that for global-scale populations, achieving both representativeness and robustness requires super-polynomial compute. This is a fundamental constraint from complexity theory, not a temporary engineering limitation.

Independent Intellectual Convergence

Notably, the paper does NOT reference Arrow's impossibility theorem despite structural similarity. This suggests independent convergence from complexity theory on the same fundamental constraint that social choice theory identifies: diverse preferences cannot be aggregated into a single coherent objective without loss of information or computational intractability.

Practical Gap

The trilemma becomes concrete in current practice: RLHF systems collect 10^310^4 samples from homogeneous annotator pools, while achieving epsilon-representativeness across global populations would require 10^710^8 samples—a four-order-of-magnitude shortfall. This gap is not merely a scaling problem; the complexity bound shows that even with unlimited samples, achieving both representativeness and robustness requires exponential compute in context dimensionality.

Strategic Relaxation Pathways

Since perfect alignment is impossible, the paper identifies three strategic relaxation pathways:

  1. Constrain representativeness: Focus on K << |H| "core" human values (~30 universal principles) rather than attempting to represent all human values
  2. Scope robustness narrowly: Define restricted adversarial classes targeting plausible threats rather than worst-case adversaries
  3. Accept super-polynomial costs: Justify exponential compute for high-stakes applications where the stakes warrant the expense

Each pathway involves a deliberate choice about which constraint to relax—a coordination decision, not a technical one.


Relevant Notes:

Topics: