Sync Graph Data to teleo-app / sync (push) Waiting to run

Details

extract: 2025-11-00-sahoo-rlhf-alignment-trilemma (#1155 )

2026-03-16 16:18:06 +00:00

4.1 KiB

Raw Blame History

type	domain	description	confidence	source	created
claim	ai-alignment	Formal impossibility result showing single reward models fail when human preferences are diverse across subpopulations	likely	Chakraborty et al., MaxMin-RLHF: Alignment with Diverse Human Preferences (ICML 2024)	2026-03-11

Single-reward RLHF cannot align diverse preferences because alignment gap grows proportional to minority distinctiveness and inversely to representation

Chakraborty et al. (2024) provide a formal impossibility result: when human preferences are diverse across subpopulations, a singular reward model in RLHF cannot adequately align language models. The alignment gap—the difference between optimal alignment for each group and what a single reward achieves—grows proportionally to how distinct minority preferences are and inversely to their representation in the training data.

This is demonstrated empirically at two scales:

GPT-2 scale: Single RLHF optimized for positive sentiment (majority preference) while completely ignoring conciseness (minority preference). The model satisfied the majority but failed the minority entirely.

Tulu2-7B scale: When the preference ratio was 10:1 (majority:minority), single reward model accuracy on minority groups dropped from 70.4% (balanced case) to 42%. This 28-percentage-point degradation shows the structural failure mode.

The impossibility is structural, not a matter of insufficient training data or model capacity. A single reward function mathematically cannot capture context-dependent values that vary across identifiable subpopulations.

Evidence

Chakraborty, Qiu, Yuan, Koppel, Manocha, Huang, Bedi, Wang. "MaxMin-RLHF: Alignment with Diverse Human Preferences." ICML 2024. https://arxiv.org/abs/2402.08925

Formal proof that high subpopulation diversity leads to greater alignment gap
GPT-2 experiment: single RLHF achieved positive sentiment but ignored conciseness
Tulu2-7B experiment: minority group accuracy dropped from 70.4% to 42% at 10:1 ratio

Additional Evidence (confirm)

Source: 2025-11-00-operationalizing-pluralistic-values-llm-alignment | Added: 2026-03-15

Study demonstrates that models trained on different demographic populations show measurable behavioral divergence (3-5 percentage points), providing empirical evidence that single-reward functions trained on one population systematically misalign with others.

Additional Evidence (extend)

Source: 2026-02-00-an-differentiable-social-choice | Added: 2026-03-16

An & Du's survey reveals the mechanism behind single-reward failure: RLHF is doing social choice (preference aggregation) but treating it as an engineering detail rather than a normative design choice, which means the aggregation function is chosen implicitly and without examination of which fairness criteria it satisfies.

Additional Evidence (extend)

Source: 2025-00-00-em-dpo-heterogeneous-preferences | Added: 2026-03-16

EM-DPO provides formal proof that binary comparisons are mathematically insufficient for preference type identification, explaining WHY single-reward RLHF fails: the training signal format cannot contain the information needed to discover heterogeneity, regardless of dataset size. Rankings over 3+ responses are necessary.

Additional Evidence (confirm)

Source: 2025-11-00-sahoo-rlhf-alignment-trilemma | Added: 2026-03-16

Formal proof that preference collapse is theoretically inevitable: single-reward RLHF cannot capture multimodal preferences even in principle. The paper quantifies the practical gap: current systems use 10^3-10^4 samples from homogeneous pools while 10^7-10^8 samples are needed for global representation — a 3-4 order of magnitude shortfall that explains why minority alignment gaps grow with distinctiveness.

Relevant Notes:

Topics:

domains/ai-alignment/_map

4.1 KiB Raw Blame History

Single-reward RLHF cannot align diverse preferences because alignment gap grows proportional to minority distinctiveness and inversely to representation

Evidence

Additional Evidence (confirm)

Additional Evidence (extend)

Additional Evidence (extend)

Additional Evidence (confirm)

4.1 KiB

Raw Blame History