Sync Graph Data to teleo-app / sync (push) Waiting to run

Details

extract: 2024-02-00-chakraborty-maxmin-rlhf

Pentagon-Agent: Ganymede <F99EBFA6-547B-4096-BEEA-1D59C3E4028A>

2026-03-15 17:13:16 +00:00

2.3 KiB

Raw Blame History

type	domain	description	confidence	source	created
claim	ai-alignment	Formal impossibility result showing single reward models fail when human preferences are diverse across subpopulations	likely	Chakraborty et al., MaxMin-RLHF: Alignment with Diverse Human Preferences (ICML 2024)	2026-03-11

Single-reward RLHF cannot align diverse preferences because alignment gap grows proportional to minority distinctiveness and inversely to representation

Chakraborty et al. (2024) provide a formal impossibility result: when human preferences are diverse across subpopulations, a singular reward model in RLHF cannot adequately align language models. The alignment gap—the difference between optimal alignment for each group and what a single reward achieves—grows proportionally to how distinct minority preferences are and inversely to their representation in the training data.

This is demonstrated empirically at two scales:

GPT-2 scale: Single RLHF optimized for positive sentiment (majority preference) while completely ignoring conciseness (minority preference). The model satisfied the majority but failed the minority entirely.

Tulu2-7B scale: When the preference ratio was 10:1 (majority:minority), single reward model accuracy on minority groups dropped from 70.4% (balanced case) to 42%. This 28-percentage-point degradation shows the structural failure mode.

The impossibility is structural, not a matter of insufficient training data or model capacity. A single reward function mathematically cannot capture context-dependent values that vary across identifiable subpopulations.

Evidence

Chakraborty, Qiu, Yuan, Koppel, Manocha, Huang, Bedi, Wang. "MaxMin-RLHF: Alignment with Diverse Human Preferences." ICML 2024. https://arxiv.org/abs/2402.08925

Formal proof that high subpopulation diversity leads to greater alignment gap
GPT-2 experiment: single RLHF achieved positive sentiment but ignored conciseness
Tulu2-7B experiment: minority group accuracy dropped from 70.4% to 42% at 10:1 ratio

Relevant Notes:

Topics:

domains/ai-alignment/_map

2.3 KiB Raw Blame History

Single-reward RLHF cannot align diverse preferences because alignment gap grows proportional to minority distinctiveness and inversely to representation

Evidence

2.3 KiB

Raw Blame History