teleo-codex/domains/ai-alignment/rlhf-alignment-trilemma-proves-no-system-can-simultaneously-achieve-representativeness-tractability-and-robustness.md
Teleo Agents debd649e7d theseus: extract from 2025-11-00-sahoo-rlhf-alignment-trilemma.md
- Source: inbox/archive/2025-11-00-sahoo-rlhf-alignment-trilemma.md
- Domain: ai-alignment
- Extracted by: headless extraction cron (worker 6)

Pentagon-Agent: Theseus <HEADLESS>
2026-03-12 04:07:31 +00:00

52 lines
4.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
type: claim
domain: ai-alignment
description: "Formal complexity-theoretic proof that RLHF faces an impossibility trilemma: no system can simultaneously achieve epsilon-representativeness, polynomial tractability, and delta-robustness"
confidence: likely
source: "Sahoo et al., 'The Complexity of Perfect AI Alignment: Formalizing the RLHF Trilemma', NeurIPS 2025 Workshop on Socially Responsible and Trustworthy Foundation Models"
created: 2026-03-11
secondary_domains: [collective-intelligence]
---
# No RLHF system can simultaneously achieve epsilon-representativeness across diverse human values, polynomial tractability in sample and compute complexity, and delta-robustness against adversarial perturbations
Sahoo et al. (2025) prove a formal alignment trilemma through complexity theory: achieving both representativeness (epsilon ≤ 0.01) and robustness (delta ≤ 0.001) for global-scale populations requires Ω(2^{d_context}) operations—super-polynomial in context dimensionality. This is an impossibility result analogous to the CAP theorem for distributed systems, not an implementation limitation.
## The Trilemma
No RLHF system can simultaneously achieve:
1. **Epsilon-representativeness**: Capturing diverse human values across populations with bounded error epsilon
2. **Polynomial tractability**: Feasible sample and compute complexity (polynomial in problem parameters)
3. **Delta-robustness**: Resistance to adversarial perturbations and distribution shift with bounded error delta
The proof establishes that for global-scale populations, achieving both representativeness and robustness requires super-polynomial compute. This is a fundamental constraint from complexity theory, not a temporary engineering limitation.
## Independent Intellectual Convergence
Notably, the paper does NOT reference Arrow's impossibility theorem despite structural similarity. This suggests independent convergence from complexity theory on the same fundamental constraint that social choice theory identifies: diverse preferences cannot be aggregated into a single coherent objective without loss of information or computational intractability.
## Practical Gap
The trilemma becomes concrete in current practice: RLHF systems collect 10^310^4 samples from homogeneous annotator pools, while achieving epsilon-representativeness across global populations would require 10^710^8 samples—a four-order-of-magnitude shortfall. This gap is not merely a scaling problem; the complexity bound shows that even with unlimited samples, achieving both representativeness and robustness requires exponential compute in context dimensionality.
## Strategic Relaxation Pathways
Since perfect alignment is impossible, the paper identifies three strategic relaxation pathways:
1. **Constrain representativeness**: Focus on K << |H| "core" human values (~30 universal principles) rather than attempting to represent all human values
2. **Scope robustness narrowly**: Define restricted adversarial classes targeting plausible threats rather than worst-case adversaries
3. **Accept super-polynomial costs**: Justify exponential compute for high-stakes applications where the stakes warrant the expense
Each pathway involves a deliberate choice about which constraint to relax—a coordination decision, not a technical one.
---
Relevant Notes:
- [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]] — this paper formalizes the informal claim through complexity theory
- [[universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective]] — independent confirmation from a different mathematical tradition
- [[safe AI development requires building alignment mechanisms before scaling capability]] — the trilemma shows alignment constraints must be decided before scaling
- [[AI alignment is a coordination problem not a technical problem]] — the trilemma reveals that technical perfection is impossible; the problem becomes choosing which constraints to relax
Topics:
- [[domains/ai-alignment/_map]]