teleo-codex/domains/ai-alignment/maxmin-rlhf-applies-egalitarian-social-choice-to-alignment-by-maximizing-minimum-utility-across-preference-groups.md
Teleo Agents 74975eb326 extract: 2025-00-00-em-dpo-heterogeneous-preferences
Pentagon-Agent: Ganymede <F99EBFA6-547B-4096-BEEA-1D59C3E4028A>
2026-03-16 15:08:47 +00:00

3.4 KiB

type domain description confidence source created secondary_domains
claim ai-alignment MaxMin-RLHF adapts Sen's Egalitarian principle to AI alignment through mixture-of-rewards and maxmin optimization experimental Chakraborty et al., MaxMin-RLHF (ICML 2024) 2026-03-11
collective-intelligence

MaxMin-RLHF applies egalitarian social choice to alignment by maximizing minimum utility across preference groups rather than averaging preferences

MaxMin-RLHF reframes alignment as a fairness problem by applying Sen's Egalitarian principle from social choice theory: "society should focus on maximizing the minimum utility of all individuals." Instead of aggregating diverse preferences into a single reward function (which the authors prove impossible), MaxMin-RLHF learns a mixture of reward models and optimizes for the worst-off group.

The mechanism has two components:

  1. EM Algorithm for Reward Mixture: Iteratively clusters humans based on preference compatibility and updates subpopulation-specific reward functions until convergence. This discovers latent preference groups from preference data.

  2. MaxMin Objective: During policy optimization, maximize the minimum utility across all discovered preference groups. This ensures no group is systematically ignored.

Empirical results:

  • Tulu2-7B scale: MaxMin maintained 56.67% win rate across both majority and minority groups, compared to single-reward RLHF which achieved 70.4% on majority but only 42% on minority (10:1 ratio case)
  • Average improvement of ~16% across groups, with ~33% boost specifically for minority groups
  • Critically: minority improvement came WITHOUT compromising majority performance

Limitations: Assumes discrete, identifiable subpopulations. Requires specifying number of clusters beforehand. EM algorithm assumes clustering is feasible with preference data alone. Does not address continuous preference distributions or cases where individuals have context-dependent preferences.

This is the first constructive mechanism that formally addresses single-reward impossibility while staying within the RLHF framework and demonstrating empirical gains.

Evidence

Chakraborty et al., "MaxMin-RLHF: Alignment with Diverse Human Preferences," ICML 2024.

  • Draws from Sen's Egalitarian rule in social choice theory
  • EM algorithm learns mixture of reward models by clustering preference-compatible humans
  • MaxMin objective: max(min utility across groups)
  • Tulu2-7B: 56.67% win rate across both groups vs 42% minority/70.4% majority for single reward
  • 33% improvement for minority groups without majority compromise

Additional Evidence (extend)

Source: 2025-00-00-em-dpo-heterogeneous-preferences | Added: 2026-03-16

MMRA extends maxmin RLHF to the deployment phase by minimizing maximum regret across preference groups when user type is unknown at inference, showing how egalitarian principles can govern both training and inference in pluralistic systems.


Relevant Notes:

Topics:

  • domains/ai-alignment/_map
  • foundations/collective-intelligence/_map