- Source: inbox/archive/2024-02-00-chakraborty-maxmin-rlhf.md - Domain: ai-alignment - Extracted by: headless extraction cron (worker 5) Pentagon-Agent: Theseus <HEADLESS>
2.9 KiB
| type | domain | description | confidence | source | created | depends_on | |
|---|---|---|---|---|---|---|---|
| claim | ai-alignment | MaxMin-RLHF achieves 33% minority group improvement while maintaining majority performance, suggesting single-reward RLHF leaves value on table rather than navigating zero-sum constraints | experimental | Chakraborty et al. (2024) MaxMin-RLHF experiments at GPT-2 and Tulu2-7B scale, ICML 2024 | 2024-02-14 |
|
MaxMin alignment improves minority group performance by 33% without compromising majority outcomes
MaxMin-RLHF achieved substantial minority group improvement (33% boost at Tulu2-7B scale) while maintaining majority group performance. This suggests that single-reward RLHF was making suboptimal tradeoffs rather than navigating genuine zero-sum constraints.
Evidence
Tulu2-7B scale with 10:1 majority:minority ratio:
- Single-reward RLHF: 70.4% majority win rate, 42% minority win rate
- MaxMin-RLHF: ~56.67% win rate for BOTH groups
- Net result: ~16% average improvement, ~33% minority-specific boost
GPT-2 scale qualitative results:
- Single RLHF optimized for positive sentiment (majority preference) while completely ignoring conciseness (minority preference)
- MaxMin satisfied both simultaneously—not through compromise but through discovering that the constraints were compatible
Why This Matters
The absence of majority performance degradation is the key finding. If alignment were genuinely zero-sum across preference groups, MaxMin would have to sacrifice majority utility to improve minority outcomes. Instead, it found Pareto improvements—outcomes better for some groups and no worse for others.
This suggests single-reward aggregation was destroying value through premature averaging, not making optimal tradeoffs given fundamental constraints. The implication is that preference diversity can be accommodated without sacrifice if the aggregation mechanism is chosen appropriately.
Limitations
Results are at GPT-2 and Tulu2-7B scale. Unclear whether Pareto improvements persist at frontier model scale or with more than two preference groups. The mechanism assumes discrete, identifiable subpopulations—continuous or overlapping preferences may not exhibit the same property.
No comparison with bridging-based approaches (RLCF, Community Notes mentioned in related work). MaxMin may be one mechanism among several that avoid premature aggregation.
Relevant Notes:
- maxmin-rlhf-applies-egalitarian-social-choice-to-alignment-by-maximizing-minimum-group-utility
- single-reward-rlhf-cannot-align-models-with-diverse-human-preferences
- RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values
- pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state
Topics: