61 lines
No EOL
4.1 KiB
Markdown
61 lines
No EOL
4.1 KiB
Markdown
---
|
|
type: claim
|
|
domain: ai-alignment
|
|
description: "Formal impossibility result showing single reward models fail when human preferences are diverse across subpopulations"
|
|
confidence: likely
|
|
source: "Chakraborty et al., MaxMin-RLHF: Alignment with Diverse Human Preferences (ICML 2024)"
|
|
created: 2026-03-11
|
|
---
|
|
|
|
# Single-reward RLHF cannot align diverse preferences because alignment gap grows proportional to minority distinctiveness and inversely to representation
|
|
|
|
Chakraborty et al. (2024) provide a formal impossibility result: when human preferences are diverse across subpopulations, a singular reward model in RLHF cannot adequately align language models. The alignment gap—the difference between optimal alignment for each group and what a single reward achieves—grows proportionally to how distinct minority preferences are and inversely to their representation in the training data.
|
|
|
|
This is demonstrated empirically at two scales:
|
|
|
|
**GPT-2 scale:** Single RLHF optimized for positive sentiment (majority preference) while completely ignoring conciseness (minority preference). The model satisfied the majority but failed the minority entirely.
|
|
|
|
**Tulu2-7B scale:** When the preference ratio was 10:1 (majority:minority), single reward model accuracy on minority groups dropped from 70.4% (balanced case) to 42%. This 28-percentage-point degradation shows the structural failure mode.
|
|
|
|
The impossibility is structural, not a matter of insufficient training data or model capacity. A single reward function mathematically cannot capture context-dependent values that vary across identifiable subpopulations.
|
|
|
|
## Evidence
|
|
|
|
Chakraborty, Qiu, Yuan, Koppel, Manocha, Huang, Bedi, Wang. "MaxMin-RLHF: Alignment with Diverse Human Preferences." ICML 2024. https://arxiv.org/abs/2402.08925
|
|
|
|
- Formal proof that high subpopulation diversity leads to greater alignment gap
|
|
- GPT-2 experiment: single RLHF achieved positive sentiment but ignored conciseness
|
|
- Tulu2-7B experiment: minority group accuracy dropped from 70.4% to 42% at 10:1 ratio
|
|
|
|
|
|
### Additional Evidence (confirm)
|
|
*Source: 2025-11-00-operationalizing-pluralistic-values-llm-alignment | Added: 2026-03-15*
|
|
|
|
Study demonstrates that models trained on different demographic populations show measurable behavioral divergence (3-5 percentage points), providing empirical evidence that single-reward functions trained on one population systematically misalign with others.
|
|
|
|
|
|
### Additional Evidence (extend)
|
|
*Source: 2026-02-00-an-differentiable-social-choice | Added: 2026-03-16*
|
|
|
|
An & Du's survey reveals the mechanism behind single-reward failure: RLHF is doing social choice (preference aggregation) but treating it as an engineering detail rather than a normative design choice, which means the aggregation function is chosen implicitly and without examination of which fairness criteria it satisfies.
|
|
|
|
|
|
### Additional Evidence (extend)
|
|
*Source: 2025-00-00-em-dpo-heterogeneous-preferences | Added: 2026-03-16*
|
|
|
|
EM-DPO provides formal proof that binary comparisons are mathematically insufficient for preference type identification, explaining WHY single-reward RLHF fails: the training signal format cannot contain the information needed to discover heterogeneity, regardless of dataset size. Rankings over 3+ responses are necessary.
|
|
|
|
|
|
### Additional Evidence (confirm)
|
|
*Source: [[2025-11-00-sahoo-rlhf-alignment-trilemma]] | Added: 2026-03-16*
|
|
|
|
Formal proof that preference collapse is theoretically inevitable: single-reward RLHF cannot capture multimodal preferences even in principle. The paper quantifies the practical gap: current systems use 10^3-10^4 samples from homogeneous pools while 10^7-10^8 samples are needed for global representation — a 3-4 order of magnitude shortfall that explains why minority alignment gaps grow with distinctiveness.
|
|
|
|
---
|
|
|
|
Relevant Notes:
|
|
- [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]]
|
|
- [[pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state]]
|
|
|
|
Topics:
|
|
- domains/ai-alignment/_map |