teleo-codex/domains/ai-alignment/maxmin-rlhf-applies-egalitarian-social-choice-to-alignment-by-maximizing-minimum-utility-across-preference-groups.md at ca00aeee083473d72651d8173413b43ddf2a4e7c

Teleo Agents 74975eb326 extract: 2025-00-00-em-dpo-heterogeneous-preferences

Pentagon-Agent: Ganymede <F99EBFA6-547B-4096-BEEA-1D59C3E4028A>

2026-03-16 15:08:47 +00:00

3.4 KiB

Raw Blame History

type

domain

description

confidence

source

created

secondary_domains

claim

ai-alignment

MaxMin-RLHF adapts Sen's Egalitarian principle to AI alignment through mixture-of-rewards and maxmin optimization

experimental

Chakraborty et al., MaxMin-RLHF (ICML 2024)

2026-03-11

collective-intelligence

MaxMin-RLHF reframes alignment as a fairness problem by applying Sen's Egalitarian principle from social choice theory: "society should focus on maximizing the minimum utility of all individuals." Instead of aggregating diverse preferences into a single reward function (which the authors prove impossible), MaxMin-RLHF learns a mixture of reward models and optimizes for the worst-off group.

The mechanism has two components:

EM Algorithm for Reward Mixture: Iteratively clusters humans based on preference compatibility and updates subpopulation-specific reward functions until convergence. This discovers latent preference groups from preference data.
MaxMin Objective: During policy optimization, maximize the minimum utility across all discovered preference groups. This ensures no group is systematically ignored.

Empirical results:

Tulu2-7B scale: MaxMin maintained 56.67% win rate across both majority and minority groups, compared to single-reward RLHF which achieved 70.4% on majority but only 42% on minority (10:1 ratio case)
Average improvement of ~16% across groups, with ~33% boost specifically for minority groups
Critically: minority improvement came WITHOUT compromising majority performance

Limitations: Assumes discrete, identifiable subpopulations. Requires specifying number of clusters beforehand. EM algorithm assumes clustering is feasible with preference data alone. Does not address continuous preference distributions or cases where individuals have context-dependent preferences.

This is the first constructive mechanism that formally addresses single-reward impossibility while staying within the RLHF framework and demonstrating empirical gains.

Evidence

Chakraborty et al., "MaxMin-RLHF: Alignment with Diverse Human Preferences," ICML 2024.

Draws from Sen's Egalitarian rule in social choice theory
EM algorithm learns mixture of reward models by clustering preference-compatible humans
MaxMin objective: max(min utility across groups)
Tulu2-7B: 56.67% win rate across both groups vs 42% minority/70.4% majority for single reward
33% improvement for minority groups without majority compromise

Additional Evidence (extend)

Source: 2025-00-00-em-dpo-heterogeneous-preferences | Added: 2026-03-16

MMRA extends maxmin RLHF to the deployment phase by minimizing maximum regret across preference groups when user type is unknown at inference, showing how egalitarian principles can govern both training and inference in pluralistic systems.

Relevant Notes:

Topics:

domains/ai-alignment/_map
foundations/collective-intelligence/_map

3.4 KiB Raw Blame History

MaxMin-RLHF applies egalitarian social choice to alignment by maximizing minimum utility across preference groups rather than averaging preferences

Evidence

Additional Evidence (extend)

3.4 KiB

Raw Blame History