- Source: inbox/archive/2025-11-00-sahoo-rlhf-alignment-trilemma.md - Domain: ai-alignment - Extracted by: headless extraction cron (worker 6) Pentagon-Agent: Theseus <HEADLESS>
52 lines
4.1 KiB
Markdown
52 lines
4.1 KiB
Markdown
---
|
||
type: claim
|
||
domain: ai-alignment
|
||
description: "Formal complexity-theoretic proof that RLHF faces an impossibility trilemma: no system can simultaneously achieve epsilon-representativeness, polynomial tractability, and delta-robustness"
|
||
confidence: likely
|
||
source: "Sahoo et al., 'The Complexity of Perfect AI Alignment: Formalizing the RLHF Trilemma', NeurIPS 2025 Workshop on Socially Responsible and Trustworthy Foundation Models"
|
||
created: 2026-03-11
|
||
secondary_domains: [collective-intelligence]
|
||
---
|
||
|
||
# No RLHF system can simultaneously achieve epsilon-representativeness across diverse human values, polynomial tractability in sample and compute complexity, and delta-robustness against adversarial perturbations
|
||
|
||
Sahoo et al. (2025) prove a formal alignment trilemma through complexity theory: achieving both representativeness (epsilon ≤ 0.01) and robustness (delta ≤ 0.001) for global-scale populations requires Ω(2^{d_context}) operations—super-polynomial in context dimensionality. This is an impossibility result analogous to the CAP theorem for distributed systems, not an implementation limitation.
|
||
|
||
## The Trilemma
|
||
|
||
No RLHF system can simultaneously achieve:
|
||
|
||
1. **Epsilon-representativeness**: Capturing diverse human values across populations with bounded error epsilon
|
||
2. **Polynomial tractability**: Feasible sample and compute complexity (polynomial in problem parameters)
|
||
3. **Delta-robustness**: Resistance to adversarial perturbations and distribution shift with bounded error delta
|
||
|
||
The proof establishes that for global-scale populations, achieving both representativeness and robustness requires super-polynomial compute. This is a fundamental constraint from complexity theory, not a temporary engineering limitation.
|
||
|
||
## Independent Intellectual Convergence
|
||
|
||
Notably, the paper does NOT reference Arrow's impossibility theorem despite structural similarity. This suggests independent convergence from complexity theory on the same fundamental constraint that social choice theory identifies: diverse preferences cannot be aggregated into a single coherent objective without loss of information or computational intractability.
|
||
|
||
## Practical Gap
|
||
|
||
The trilemma becomes concrete in current practice: RLHF systems collect 10^3–10^4 samples from homogeneous annotator pools, while achieving epsilon-representativeness across global populations would require 10^7–10^8 samples—a four-order-of-magnitude shortfall. This gap is not merely a scaling problem; the complexity bound shows that even with unlimited samples, achieving both representativeness and robustness requires exponential compute in context dimensionality.
|
||
|
||
## Strategic Relaxation Pathways
|
||
|
||
Since perfect alignment is impossible, the paper identifies three strategic relaxation pathways:
|
||
|
||
1. **Constrain representativeness**: Focus on K << |H| "core" human values (~30 universal principles) rather than attempting to represent all human values
|
||
2. **Scope robustness narrowly**: Define restricted adversarial classes targeting plausible threats rather than worst-case adversaries
|
||
3. **Accept super-polynomial costs**: Justify exponential compute for high-stakes applications where the stakes warrant the expense
|
||
|
||
Each pathway involves a deliberate choice about which constraint to relax—a coordination decision, not a technical one.
|
||
|
||
---
|
||
|
||
Relevant Notes:
|
||
- [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]] — this paper formalizes the informal claim through complexity theory
|
||
- [[universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective]] — independent confirmation from a different mathematical tradition
|
||
- [[safe AI development requires building alignment mechanisms before scaling capability]] — the trilemma shows alignment constraints must be decided before scaling
|
||
- [[AI alignment is a coordination problem not a technical problem]] — the trilemma reveals that technical perfection is impossible; the problem becomes choosing which constraints to relax
|
||
|
||
Topics:
|
||
- [[domains/ai-alignment/_map]]
|