teleo-codex/domains/ai-alignment/single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md
Teleo Pipeline db5bbf3eb7 reweave: connect 48 orphan claims via vector similarity
Threshold: 0.7, Haiku classification, 80 files modified.

Pentagon-Agent: Epimetheus <0144398e-4ed3-4fe2-95a3-3d72e1abf887>
2026-03-28 23:04:53 +00:00

5.1 KiB

type domain description confidence source created supports reweave_edges related
claim ai-alignment Formal impossibility result showing single reward models fail when human preferences are diverse across subpopulations likely Chakraborty et al., MaxMin-RLHF: Alignment with Diverse Human Preferences (ICML 2024) 2026-03-11
maxmin rlhf applies egalitarian social choice to alignment by maximizing minimum utility across preference groups
minority preference alignment improves 33 percent without majority compromise suggesting single reward leaves value on table
rlchf features based variant models individual preferences with evaluator characteristics enabling aggregation across diverse groups
maxmin rlhf applies egalitarian social choice to alignment by maximizing minimum utility across preference groups|supports|2026-03-28
minority preference alignment improves 33 percent without majority compromise suggesting single reward leaves value on table|supports|2026-03-28
rlchf features based variant models individual preferences with evaluator characteristics enabling aggregation across diverse groups|supports|2026-03-28
rlhf is implicit social choice without normative scrutiny|related|2026-03-28
rlhf is implicit social choice without normative scrutiny

Single-reward RLHF cannot align diverse preferences because alignment gap grows proportional to minority distinctiveness and inversely to representation

Chakraborty et al. (2024) provide a formal impossibility result: when human preferences are diverse across subpopulations, a singular reward model in RLHF cannot adequately align language models. The alignment gap—the difference between optimal alignment for each group and what a single reward achieves—grows proportionally to how distinct minority preferences are and inversely to their representation in the training data.

This is demonstrated empirically at two scales:

GPT-2 scale: Single RLHF optimized for positive sentiment (majority preference) while completely ignoring conciseness (minority preference). The model satisfied the majority but failed the minority entirely.

Tulu2-7B scale: When the preference ratio was 10:1 (majority:minority), single reward model accuracy on minority groups dropped from 70.4% (balanced case) to 42%. This 28-percentage-point degradation shows the structural failure mode.

The impossibility is structural, not a matter of insufficient training data or model capacity. A single reward function mathematically cannot capture context-dependent values that vary across identifiable subpopulations.

Evidence

Chakraborty, Qiu, Yuan, Koppel, Manocha, Huang, Bedi, Wang. "MaxMin-RLHF: Alignment with Diverse Human Preferences." ICML 2024. https://arxiv.org/abs/2402.08925

  • Formal proof that high subpopulation diversity leads to greater alignment gap
  • GPT-2 experiment: single RLHF achieved positive sentiment but ignored conciseness
  • Tulu2-7B experiment: minority group accuracy dropped from 70.4% to 42% at 10:1 ratio

Additional Evidence (confirm)

Source: 2025-11-00-operationalizing-pluralistic-values-llm-alignment | Added: 2026-03-15

Study demonstrates that models trained on different demographic populations show measurable behavioral divergence (3-5 percentage points), providing empirical evidence that single-reward functions trained on one population systematically misalign with others.

Additional Evidence (extend)

Source: 2026-02-00-an-differentiable-social-choice | Added: 2026-03-16

An & Du's survey reveals the mechanism behind single-reward failure: RLHF is doing social choice (preference aggregation) but treating it as an engineering detail rather than a normative design choice, which means the aggregation function is chosen implicitly and without examination of which fairness criteria it satisfies.

Additional Evidence (extend)

Source: 2025-00-00-em-dpo-heterogeneous-preferences | Added: 2026-03-16

EM-DPO provides formal proof that binary comparisons are mathematically insufficient for preference type identification, explaining WHY single-reward RLHF fails: the training signal format cannot contain the information needed to discover heterogeneity, regardless of dataset size. Rankings over 3+ responses are necessary.

Additional Evidence (confirm)

Source: 2025-11-00-sahoo-rlhf-alignment-trilemma | Added: 2026-03-16

Formal proof that preference collapse is theoretically inevitable: single-reward RLHF cannot capture multimodal preferences even in principle. The paper quantifies the practical gap: current systems use 10^3-10^4 samples from homogeneous pools while 10^7-10^8 samples are needed for global representation — a 3-4 order of magnitude shortfall that explains why minority alignment gaps grow with distinctiveness.


Relevant Notes:

Topics:

  • domains/ai-alignment/_map