extract: 2025-00-00-em-dpo-heterogeneous-preferences
Pentagon-Agent: Ganymede <F99EBFA6-547B-4096-BEEA-1D59C3E4028A>
This commit is contained in:
parent
73f5df250b
commit
f44d100882
6 changed files with 74 additions and 1 deletions
|
|
@ -37,6 +37,12 @@ Chakraborty et al., "MaxMin-RLHF: Alignment with Diverse Human Preferences," ICM
|
|||
- Tulu2-7B: 56.67% win rate across both groups vs 42% minority/70.4% majority for single reward
|
||||
- 33% improvement for minority groups without majority compromise
|
||||
|
||||
|
||||
### Additional Evidence (extend)
|
||||
*Source: [[2025-00-00-em-dpo-heterogeneous-preferences]] | Added: 2026-03-16*
|
||||
|
||||
EM-DPO implements MinMax Regret Aggregation (MMRA) for ensemble deployment, which is based on the same egalitarian social choice principle (min-max fairness) but applies it at inference time across an ensemble of specialized models rather than during reward model training. This shows the egalitarian criterion can be implemented at different architectural points.
|
||||
|
||||
---
|
||||
|
||||
Relevant Notes:
|
||||
|
|
|
|||
|
|
@ -25,6 +25,12 @@ Since [[universal alignment is mathematically impossible because Arrows impossib
|
|||
|
||||
MaxMin-RLHF provides a constructive implementation of pluralistic alignment through mixture-of-rewards and egalitarian optimization. Rather than converging preferences, it learns separate reward models for each subpopulation and optimizes for the worst-off group (Sen's Egalitarian principle). At Tulu2-7B scale, this achieved 56.67% win rate across both majority and minority groups, compared to single-reward's 70.4%/42% split. The mechanism accommodates irreducible diversity by maintaining separate reward functions rather than forcing convergence.
|
||||
|
||||
|
||||
### Additional Evidence (confirm)
|
||||
*Source: [[2025-00-00-em-dpo-heterogeneous-preferences]] | Added: 2026-03-16*
|
||||
|
||||
EM-DPO demonstrates a concrete implementation of pluralistic alignment through ensemble models where each model serves a different preference type, combined via egalitarian aggregation. The system maintains multiple specialized models rather than forcing convergence to a single aligned state.
|
||||
|
||||
---
|
||||
|
||||
Relevant Notes:
|
||||
|
|
|
|||
|
|
@ -33,6 +33,12 @@ The paper's proposed solution—RLCHF with explicit social welfare functions—c
|
|||
|
||||
RLCF makes the social choice mechanism explicit through the bridging algorithm (matrix factorization with intercept scores). Unlike standard RLHF which aggregates preferences opaquely through reward model training, RLCF's use of intercepts as the training signal is a deliberate choice to optimize for cross-partisan agreement—a specific social welfare function.
|
||||
|
||||
|
||||
### Additional Evidence (confirm)
|
||||
*Source: [[2025-00-00-em-dpo-heterogeneous-preferences]] | Added: 2026-03-16*
|
||||
|
||||
EM-DPO explicitly grounds its aggregation mechanism (MMRA) in egalitarian social choice theory, demonstrating that pluralistic alignment requires making social choice principles explicit rather than leaving them implicit in the training procedure.
|
||||
|
||||
---
|
||||
|
||||
Relevant Notes:
|
||||
|
|
|
|||
|
|
@ -33,6 +33,12 @@ Chakraborty, Qiu, Yuan, Koppel, Manocha, Huang, Bedi, Wang. "MaxMin-RLHF: Alignm
|
|||
|
||||
Study demonstrates that models trained on different demographic populations show measurable behavioral divergence (3-5 percentage points), providing empirical evidence that single-reward functions trained on one population systematically misalign with others.
|
||||
|
||||
|
||||
### Additional Evidence (extend)
|
||||
*Source: [[2025-00-00-em-dpo-heterogeneous-preferences]] | Added: 2026-03-16*
|
||||
|
||||
EM-DPO provides a formal identifiability proof showing that binary comparisons (used in standard RLHF/DPO) cannot detect preference heterogeneity, while rankings over 3+ responses can. This explains the mechanism behind why single-reward approaches fail: they use a data format that is information-theoretically insufficient to distinguish preference subpopulations.
|
||||
|
||||
---
|
||||
|
||||
Relevant Notes:
|
||||
|
|
|
|||
|
|
@ -0,0 +1,38 @@
|
|||
{
|
||||
"rejected_claims": [
|
||||
{
|
||||
"filename": "binary-preference-comparisons-cannot-identify-latent-preference-types-requiring-rankings-over-three-plus-responses.md",
|
||||
"issues": [
|
||||
"missing_attribution_extractor"
|
||||
]
|
||||
},
|
||||
{
|
||||
"filename": "em-algorithm-discovers-latent-preference-subpopulations-enabling-ensemble-alignment-without-demographic-labels.md",
|
||||
"issues": [
|
||||
"missing_attribution_extractor"
|
||||
]
|
||||
}
|
||||
],
|
||||
"validation_stats": {
|
||||
"total": 2,
|
||||
"kept": 0,
|
||||
"fixed": 8,
|
||||
"rejected": 2,
|
||||
"fixes_applied": [
|
||||
"binary-preference-comparisons-cannot-identify-latent-preference-types-requiring-rankings-over-three-plus-responses.md:set_created:2026-03-16",
|
||||
"binary-preference-comparisons-cannot-identify-latent-preference-types-requiring-rankings-over-three-plus-responses.md:stripped_wiki_link:single-reward-rlhf-cannot-align-diverse-preferences-because-",
|
||||
"binary-preference-comparisons-cannot-identify-latent-preference-types-requiring-rankings-over-three-plus-responses.md:stripped_wiki_link:rlhf-is-implicit-social-choice-without-normative-scrutiny.md",
|
||||
"binary-preference-comparisons-cannot-identify-latent-preference-types-requiring-rankings-over-three-plus-responses.md:stripped_wiki_link:modeling preference sensitivity as a learned distribution ra",
|
||||
"em-algorithm-discovers-latent-preference-subpopulations-enabling-ensemble-alignment-without-demographic-labels.md:set_created:2026-03-16",
|
||||
"em-algorithm-discovers-latent-preference-subpopulations-enabling-ensemble-alignment-without-demographic-labels.md:stripped_wiki_link:pluralistic alignment must accommodate irreducibly diverse v",
|
||||
"em-algorithm-discovers-latent-preference-subpopulations-enabling-ensemble-alignment-without-demographic-labels.md:stripped_wiki_link:maxmin-rlhf-applies-egalitarian-social-choice-to-alignment-b",
|
||||
"em-algorithm-discovers-latent-preference-subpopulations-enabling-ensemble-alignment-without-demographic-labels.md:stripped_wiki_link:minority-preference-alignment-improves-33-percent-without-ma"
|
||||
],
|
||||
"rejections": [
|
||||
"binary-preference-comparisons-cannot-identify-latent-preference-types-requiring-rankings-over-three-plus-responses.md:missing_attribution_extractor",
|
||||
"em-algorithm-discovers-latent-preference-subpopulations-enabling-ensemble-alignment-without-demographic-labels.md:missing_attribution_extractor"
|
||||
]
|
||||
},
|
||||
"model": "anthropic/claude-sonnet-4.5",
|
||||
"date": "2026-03-16"
|
||||
}
|
||||
|
|
@ -7,9 +7,13 @@ date: 2025-01-01
|
|||
domain: ai-alignment
|
||||
secondary_domains: []
|
||||
format: paper
|
||||
status: unprocessed
|
||||
status: enrichment
|
||||
priority: medium
|
||||
tags: [pluralistic-alignment, EM-algorithm, preference-clustering, ensemble-LLM, fairness]
|
||||
processed_by: theseus
|
||||
processed_date: 2026-03-16
|
||||
enrichments_applied: ["single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md", "maxmin-rlhf-applies-egalitarian-social-choice-to-alignment-by-maximizing-minimum-utility-across-preference-groups.md", "pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state.md", "rlhf-is-implicit-social-choice-without-normative-scrutiny.md"]
|
||||
extraction_model: "anthropic/claude-sonnet-4.5"
|
||||
---
|
||||
|
||||
## Content
|
||||
|
|
@ -39,3 +43,10 @@ EM-DPO uses expectation-maximization to simultaneously uncover latent user prefe
|
|||
PRIMARY CONNECTION: RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values
|
||||
WHY ARCHIVED: The binary-comparison insufficiency claim is a novel formal result that strengthens the case against standard alignment approaches
|
||||
EXTRACTION HINT: Focus on the formal insufficiency of binary comparisons and the EM + egalitarian aggregation combination
|
||||
|
||||
|
||||
## Key Facts
|
||||
- EM-DPO paper presented at EAAMO 2025 (Equity and Access in Algorithms, Mechanisms, and Optimization)
|
||||
- MinMax Regret Aggregation (MMRA) is based on min-max regret fairness criterion from egalitarian social choice theory
|
||||
- EM-DPO evaluation used benchmark tasks but did not include head-to-head comparison with PAL or MixDPO
|
||||
- The algorithm works within Arrow's framework by applying a specific social choice principle (egalitarian min-max)
|
||||
|
|
|
|||
Loading…
Reference in a new issue