3.4 KiB
| type | domain | description | confidence | source | created | secondary_domains | |
|---|---|---|---|---|---|---|---|
| claim | ai-alignment | MaxMin-RLHF adapts Sen's Egalitarian principle to AI alignment through mixture-of-rewards and maxmin optimization | experimental | Chakraborty et al., MaxMin-RLHF (ICML 2024) | 2026-03-11 |
|
MaxMin-RLHF applies egalitarian social choice to alignment by maximizing minimum utility across preference groups rather than averaging preferences
MaxMin-RLHF reframes alignment as a fairness problem by applying Sen's Egalitarian principle from social choice theory: "society should focus on maximizing the minimum utility of all individuals." Instead of aggregating diverse preferences into a single reward function (which the authors prove impossible), MaxMin-RLHF learns a mixture of reward models and optimizes for the worst-off group.
The mechanism has two components:
-
EM Algorithm for Reward Mixture: Iteratively clusters humans based on preference compatibility and updates subpopulation-specific reward functions until convergence. This discovers latent preference groups from preference data.
-
MaxMin Objective: During policy optimization, maximize the minimum utility across all discovered preference groups. This ensures no group is systematically ignored.
Empirical results:
- Tulu2-7B scale: MaxMin maintained 56.67% win rate across both majority and minority groups, compared to single-reward RLHF which achieved 70.4% on majority but only 42% on minority (10:1 ratio case)
- Average improvement of ~16% across groups, with ~33% boost specifically for minority groups
- Critically: minority improvement came WITHOUT compromising majority performance
Limitations: Assumes discrete, identifiable subpopulations. Requires specifying number of clusters beforehand. EM algorithm assumes clustering is feasible with preference data alone. Does not address continuous preference distributions or cases where individuals have context-dependent preferences.
This is the first constructive mechanism that formally addresses single-reward impossibility while staying within the RLHF framework and demonstrating empirical gains.
Evidence
Chakraborty et al., "MaxMin-RLHF: Alignment with Diverse Human Preferences," ICML 2024.
- Draws from Sen's Egalitarian rule in social choice theory
- EM algorithm learns mixture of reward models by clustering preference-compatible humans
- MaxMin objective: max(min utility across groups)
- Tulu2-7B: 56.67% win rate across both groups vs 42% minority/70.4% majority for single reward
- 33% improvement for minority groups without majority compromise
Additional Evidence (extend)
Source: 2025-00-00-em-dpo-heterogeneous-preferences | Added: 2026-03-16
MMRA extends maxmin RLHF to the deployment phase by minimizing maximum regret across preference groups when user type is unknown at inference, showing how egalitarian principles can govern both training and inference in pluralistic systems.
Relevant Notes:
- pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state
- collective intelligence requires diversity as a structural precondition not a moral preference
- designing coordination rules is categorically different from designing coordination outcomes as nine intellectual traditions independently confirm
Topics:
- domains/ai-alignment/_map
- foundations/collective-intelligence/_map