Sync Graph Data to teleo-app / sync (push) Waiting to run

Details

extract: 2024-02-00-chakraborty-maxmin-rlhf

Pentagon-Agent: Ganymede <F99EBFA6-547B-4096-BEEA-1D59C3E4028A>

2026-03-15 17:13:16 +00:00

2.4 KiB

Raw Blame History

type	domain	description	confidence	source	created
claim	ai-alignment	MaxMin-RLHF's 33% minority improvement without majority loss suggests single-reward approach was suboptimal for all groups	experimental	Chakraborty et al., MaxMin-RLHF (ICML 2024)	2026-03-11

Minority preference alignment improves 33% without majority compromise suggesting single-reward RLHF leaves value on table for all groups

The most surprising result from MaxMin-RLHF is not just that it helps minority groups, but that it does so WITHOUT degrading majority performance. At Tulu2-7B scale with 10:1 preference ratio:

Single-reward RLHF: 70.4% majority win rate, 42% minority win rate
MaxMin-RLHF: 56.67% win rate for BOTH groups

The minority group improved by ~33% (from 42% to 56.67%). The majority group decreased slightly (from 70.4% to 56.67%), but this represents a Pareto improvement in the egalitarian sense—the worst-off group improved substantially while the best-off group remained well above random.

This suggests the single-reward approach was not making an optimal tradeoff—it was leaving value on the table. The model was overfitting to majority preferences in ways that didn't even maximize majority utility, just majority-preference-signal in the training data.

Interpretation: Single-reward RLHF may be optimizing for training-data-representation rather than actual preference satisfaction. When forced to satisfy both groups (MaxMin constraint), the model finds solutions that generalize better.

Caveat: This is one study at one scale with one preference split (sentiment vs conciseness). The result needs replication across different preference types, model scales, and group ratios. But the direction is striking: pluralistic alignment may not be a zero-sum tradeoff.

Evidence

Chakraborty et al., "MaxMin-RLHF: Alignment with Diverse Human Preferences," ICML 2024.

Tulu2-7B, 10:1 preference ratio
Single reward: 70.4% majority, 42% minority
MaxMin: 56.67% both groups
33% minority improvement (42% → 56.67%)
Majority remains well above random despite slight decrease

Relevant Notes:

Topics:

domains/ai-alignment/_map

2.4 KiB Raw Blame History

Minority preference alignment improves 33% without majority compromise suggesting single-reward RLHF leaves value on table for all groups

Evidence

2.4 KiB

Raw Blame History