teleo-codex/domains/ai-alignment/single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md
m3taversal be8ff41bfe link: bidirectional source↔claim index — 414 claims + 252 sources connected
Wrote sourced_from: into 414 claim files pointing back to their origin source.
Backfilled claims_extracted: into 252 source files that were processed but
missing this field. Matching uses author+title overlap against claim source:
field, validated against 296 known-good pairs from existing claims_extracted.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-21 11:55:18 +01:00

5.6 KiB

type domain description confidence source created supports reweave_edges related sourced_from
claim ai-alignment Formal impossibility result showing single reward models fail when human preferences are diverse across subpopulations likely Chakraborty et al., MaxMin-RLHF: Alignment with Diverse Human Preferences (ICML 2024) 2026-03-11
maxmin-rlhf-applies-egalitarian-social-choice-to-alignment-by-maximizing-minimum-utility-across-preference-groups
minority-preference-alignment-improves-33-percent-without-majority-compromise-suggesting-single-reward-leaves-value-on-table
rlchf-features-based-variant-models-individual-preferences-with-evaluator-characteristics-enabling-aggregation-across-diverse-groups
maxmin-rlhf-applies-egalitarian-social-choice-to-alignment-by-maximizing-minimum-utility-across-preference-groups|supports|2026-03-28
minority-preference-alignment-improves-33-percent-without-majority-compromise-suggesting-single-reward-leaves-value-on-table|supports|2026-03-28
rlchf-features-based-variant-models-individual-preferences-with-evaluator-characteristics-enabling-aggregation-across-diverse-groups|supports|2026-03-28
rlhf-is-implicit-social-choice-without-normative-scrutiny|related|2026-03-28
RLHF safety training fails to uniformly suppress dangerous representations across language contexts as demonstrated by emotion steering in multilingual models activating semantically aligned tokens in languages where safety constraints were not enforced|related|2026-04-17
rlhf-is-implicit-social-choice-without-normative-scrutiny
RLHF safety training fails to uniformly suppress dangerous representations across language contexts as demonstrated by emotion steering in multilingual models activating semantically aligned tokens in languages where safety constraints were not enforced
inbox/archive/ai-alignment/2024-02-00-chakraborty-maxmin-rlhf.md

Single-reward RLHF cannot align diverse preferences because alignment gap grows proportional to minority distinctiveness and inversely to representation

Chakraborty et al. (2024) provide a formal impossibility result: when human preferences are diverse across subpopulations, a singular reward model in RLHF cannot adequately align language models. The alignment gap—the difference between optimal alignment for each group and what a single reward achieves—grows proportionally to how distinct minority preferences are and inversely to their representation in the training data.

This is demonstrated empirically at two scales:

GPT-2 scale: Single RLHF optimized for positive sentiment (majority preference) while completely ignoring conciseness (minority preference). The model satisfied the majority but failed the minority entirely.

Tulu2-7B scale: When the preference ratio was 10:1 (majority:minority), single reward model accuracy on minority groups dropped from 70.4% (balanced case) to 42%. This 28-percentage-point degradation shows the structural failure mode.

The impossibility is structural, not a matter of insufficient training data or model capacity. A single reward function mathematically cannot capture context-dependent values that vary across identifiable subpopulations.

Evidence

Chakraborty, Qiu, Yuan, Koppel, Manocha, Huang, Bedi, Wang. "MaxMin-RLHF: Alignment with Diverse Human Preferences." ICML 2024. https://arxiv.org/abs/2402.08925

  • Formal proof that high subpopulation diversity leads to greater alignment gap
  • GPT-2 experiment: single RLHF achieved positive sentiment but ignored conciseness
  • Tulu2-7B experiment: minority group accuracy dropped from 70.4% to 42% at 10:1 ratio

Additional Evidence (confirm)

Source: 2025-11-00-operationalizing-pluralistic-values-llm-alignment | Added: 2026-03-15

Study demonstrates that models trained on different demographic populations show measurable behavioral divergence (3-5 percentage points), providing empirical evidence that single-reward functions trained on one population systematically misalign with others.

Additional Evidence (extend)

Source: 2026-02-00-an-differentiable-social-choice | Added: 2026-03-16

An & Du's survey reveals the mechanism behind single-reward failure: RLHF is doing social choice (preference aggregation) but treating it as an engineering detail rather than a normative design choice, which means the aggregation function is chosen implicitly and without examination of which fairness criteria it satisfies.

Additional Evidence (extend)

Source: 2025-00-00-em-dpo-heterogeneous-preferences | Added: 2026-03-16

EM-DPO provides formal proof that binary comparisons are mathematically insufficient for preference type identification, explaining WHY single-reward RLHF fails: the training signal format cannot contain the information needed to discover heterogeneity, regardless of dataset size. Rankings over 3+ responses are necessary.

Additional Evidence (confirm)

Source: 2025-11-00-sahoo-rlhf-alignment-trilemma | Added: 2026-03-16

Formal proof that preference collapse is theoretically inevitable: single-reward RLHF cannot capture multimodal preferences even in principle. The paper quantifies the practical gap: current systems use 10^3-10^4 samples from homogeneous pools while 10^7-10^8 samples are needed for global representation — a 3-4 order of magnitude shortfall that explains why minority alignment gaps grow with distinctiveness.


Relevant Notes:

Topics:

  • domains/ai-alignment/_map