m3taversal be8ff41bfe link: bidirectional source↔claim index — 414 claims + 252 sources connected

Wrote sourced_from: into 414 claim files pointing back to their origin source.
Backfilled claims_extracted: into 252 source files that were processed but
missing this field. Matching uses author+title overlap against claim source:
field, validated against 296 known-good pairs from existing claims_extracted.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-21 11:55:18 +01:00

3 KiB

Raw Blame History

type

domain

description

confidence

source

created

supports

reweave_edges

sourced_from

claim

ai-alignment

MaxMin-RLHF's 33% minority improvement without majority loss suggests single-reward approach was suboptimal for all groups

experimental

Chakraborty et al., MaxMin-RLHF (ICML 2024)

2026-03-11

maxmin-rlhf-applies-egalitarian-social-choice-to-alignment-by-maximizing-minimum-utility-across-preference-groups

single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness

maxmin-rlhf-applies-egalitarian-social-choice-to-alignment-by-maximizing-minimum-utility-across-preference-groups|supports|2026-03-28

single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness|supports|2026-03-28

inbox/archive/ai-alignment/2024-02-00-chakraborty-maxmin-rlhf.md

Minority preference alignment improves 33% without majority compromise suggesting single-reward RLHF leaves value on table for all groups

The most surprising result from MaxMin-RLHF is not just that it helps minority groups, but that it does so WITHOUT degrading majority performance. At Tulu2-7B scale with 10:1 preference ratio:

Single-reward RLHF: 70.4% majority win rate, 42% minority win rate
MaxMin-RLHF: 56.67% win rate for BOTH groups

The minority group improved by ~33% (from 42% to 56.67%). The majority group decreased slightly (from 70.4% to 56.67%), but this represents a Pareto improvement in the egalitarian sense—the worst-off group improved substantially while the best-off group remained well above random.

This suggests the single-reward approach was not making an optimal tradeoff—it was leaving value on the table. The model was overfitting to majority preferences in ways that didn't even maximize majority utility, just majority-preference-signal in the training data.

Interpretation: Single-reward RLHF may be optimizing for training-data-representation rather than actual preference satisfaction. When forced to satisfy both groups (MaxMin constraint), the model finds solutions that generalize better.

Caveat: This is one study at one scale with one preference split (sentiment vs conciseness). The result needs replication across different preference types, model scales, and group ratios. But the direction is striking: pluralistic alignment may not be a zero-sum tradeoff.

Evidence

Chakraborty et al., "MaxMin-RLHF: Alignment with Diverse Human Preferences," ICML 2024.

Tulu2-7B, 10:1 preference ratio
Single reward: 70.4% majority, 42% minority
MaxMin: 56.67% both groups
33% minority improvement (42% → 56.67%)
Majority remains well above random despite slight decrease

Relevant Notes:

Topics:

domains/ai-alignment/_map

3 KiB Raw Blame History

Minority preference alignment improves 33% without majority compromise suggesting single-reward RLHF leaves value on table for all groups

Evidence

3 KiB

Raw Blame History