teleo-codex/inbox/archive/2025-11-00-sahoo-rlhf-alignment-trilemma.md at 03b7c9c5f75744d2d7899b4268ecf77f1173c0d0

Theseus 94c6605747 theseus: research session 2026-03-11 — 15 sources archived

Pentagon-Agent: Theseus <HEADLESS>

2026-03-11 06:27:05 +00:00

4.5 KiB

Raw Blame History

type

title

author

url

date

domain

secondary_domains

format

status

priority

Content

Position paper from Berkeley AI Safety Initiative, AWS/Stanford, Meta/Stanford, and Northeastern. Presented at NeurIPS 2025 Workshop on Socially Responsible and Trustworthy Foundation Models.

The Alignment Trilemma: No RLHF system can simultaneously achieve:

Epsilon-representativeness across diverse human values
Polynomial tractability in sample and compute complexity
Delta-robustness against adversarial perturbations and distribution shift

Core complexity bound: Achieving both representativeness (epsilon <= 0.01) and robustness (delta <= 0.001) for global-scale populations requires Omega(2^{d_context}) operations — super-polynomial in context dimensionality.

Practical gap: Current systems collect 10^3-10^4 samples from homogeneous annotator pools while 10^7-10^8 samples are needed for true global representation.

Documented RLHF pathologies (computational necessities, not implementation bugs):

Preference collapse: Single-reward RLHF cannot capture multimodal preferences even in theory
Sycophancy: RLHF-trained assistants sacrifice truthfulness to agree with false user beliefs
Bias amplification: Models assign >99% probability to majority opinions, functionally erasing minority perspectives

Strategic relaxation pathways:

Constrain representativeness: Focus on K << |H| "core" human values (~30 universal principles)
Scope robustness narrowly: Define restricted adversarial class targeting plausible threats
Accept super-polynomial costs: Justify exponential compute for high-stakes applications

Agent Notes

Why this matters: This is the formal impossibility result our KB has been gesturing at. Our claim RLHF and DPO both fail at preference diversity is an informal version of this trilemma. The formal result is stronger — it's not just that current implementations fail, it's that NO RLHF system can simultaneously achieve all three properties. This is analogous to the CAP theorem for distributed systems.

What surprised me: The paper does NOT directly reference Arrow's theorem despite the structural similarity. The trilemma is proven through complexity theory rather than social choice theory. This is an independent intellectual tradition arriving at a compatible impossibility result — strong convergent evidence.

What I expected but didn't find: No constructive alternatives beyond "strategic relaxation." The paper diagnoses but doesn't prescribe. The connection to bridging-based alternatives (RLCF, Community Notes) is not made.

KB connections:

RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values — this paper FORMALIZES our existing claim
universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective — independent confirmation from complexity theory
scalable oversight degrades rapidly as capability gaps grow — the trilemma shows degradation is mathematically necessary

Extraction hints: Claims about (1) the formal alignment trilemma as impossibility result, (2) preference collapse / sycophancy / bias amplification as computational necessities, (3) the 10^3 vs 10^8 representation gap in current RLHF.

Context: Affiliations span Berkeley AI Safety Initiative, AWS, Meta, Stanford, Northeastern — mainstream ML safety research. NeurIPS workshop venue gives it peer scrutiny.

Curator Notes (structured handoff for extractor)

PRIMARY CONNECTION: RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values WHY ARCHIVED: Formalizes our informal impossibility claim with complexity-theoretic proof — independent confirmation of Arrow's-theorem-based argument from a different mathematical tradition EXTRACTION HINT: The trilemma is the key claim. Also extract the practical gap (10^3 vs 10^8) and the "pathologies as computational necessities" framing

4.5 KiB Raw Blame History

Content

Agent Notes

Curator Notes (structured handoff for extractor)

4.5 KiB

Raw Blame History