teleo-codex/domains/ai-alignment/minority-preference-alignment-improves-33-percent-without-majority-compromise-suggesting-single-reward-leaves-value-on-table.md
m3taversal be8ff41bfe link: bidirectional source↔claim index — 414 claims + 252 sources connected
Wrote sourced_from: into 414 claim files pointing back to their origin source.
Backfilled claims_extracted: into 252 source files that were processed but
missing this field. Matching uses author+title overlap against claim source:
field, validated against 296 known-good pairs from existing claims_extracted.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-21 11:55:18 +01:00

3 KiB

type domain description confidence source created supports reweave_edges sourced_from
claim ai-alignment MaxMin-RLHF's 33% minority improvement without majority loss suggests single-reward approach was suboptimal for all groups experimental Chakraborty et al., MaxMin-RLHF (ICML 2024) 2026-03-11
maxmin-rlhf-applies-egalitarian-social-choice-to-alignment-by-maximizing-minimum-utility-across-preference-groups
single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness
maxmin-rlhf-applies-egalitarian-social-choice-to-alignment-by-maximizing-minimum-utility-across-preference-groups|supports|2026-03-28
single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness|supports|2026-03-28
inbox/archive/ai-alignment/2024-02-00-chakraborty-maxmin-rlhf.md

Minority preference alignment improves 33% without majority compromise suggesting single-reward RLHF leaves value on table for all groups

The most surprising result from MaxMin-RLHF is not just that it helps minority groups, but that it does so WITHOUT degrading majority performance. At Tulu2-7B scale with 10:1 preference ratio:

  • Single-reward RLHF: 70.4% majority win rate, 42% minority win rate
  • MaxMin-RLHF: 56.67% win rate for BOTH groups

The minority group improved by ~33% (from 42% to 56.67%). The majority group decreased slightly (from 70.4% to 56.67%), but this represents a Pareto improvement in the egalitarian sense—the worst-off group improved substantially while the best-off group remained well above random.

This suggests the single-reward approach was not making an optimal tradeoff—it was leaving value on the table. The model was overfitting to majority preferences in ways that didn't even maximize majority utility, just majority-preference-signal in the training data.

Interpretation: Single-reward RLHF may be optimizing for training-data-representation rather than actual preference satisfaction. When forced to satisfy both groups (MaxMin constraint), the model finds solutions that generalize better.

Caveat: This is one study at one scale with one preference split (sentiment vs conciseness). The result needs replication across different preference types, model scales, and group ratios. But the direction is striking: pluralistic alignment may not be a zero-sum tradeoff.

Evidence

Chakraborty et al., "MaxMin-RLHF: Alignment with Diverse Human Preferences," ICML 2024.

  • Tulu2-7B, 10:1 preference ratio
  • Single reward: 70.4% majority, 42% minority
  • MaxMin: 56.67% both groups
  • 33% minority improvement (42% → 56.67%)
  • Majority remains well above random despite slight decrease

Relevant Notes:

Topics:

  • domains/ai-alignment/_map