extract: 2025-00-00-em-dpo-heterogeneous-preferences
Pentagon-Agent: Ganymede <F99EBFA6-547B-4096-BEEA-1D59C3E4028A>
This commit is contained in:
parent
bfb2e03271
commit
2916d871e9
6 changed files with 84 additions and 1 deletions
|
|
@ -37,6 +37,12 @@ Chakraborty et al., "MaxMin-RLHF: Alignment with Diverse Human Preferences," ICM
|
|||
- Tulu2-7B: 56.67% win rate across both groups vs 42% minority/70.4% majority for single reward
|
||||
- 33% improvement for minority groups without majority compromise
|
||||
|
||||
|
||||
### Additional Evidence (confirm)
|
||||
*Source: [[2025-00-00-em-dpo-heterogeneous-preferences]] | Added: 2026-03-15*
|
||||
|
||||
EM-DPO's MinMax Regret Aggregation independently implements egalitarian social choice for ensemble deployment, confirming that min-max fairness is a viable and practically effective approach to pluralistic alignment. MMRA ensures no preference group experiences severe underservice.
|
||||
|
||||
---
|
||||
|
||||
Relevant Notes:
|
||||
|
|
|
|||
|
|
@ -28,6 +28,12 @@ Since [[pluralistic alignment must accommodate irreducibly diverse values simult
|
|||
|
||||
MixDPO has not yet been compared to PAL or RLCF in the paper, leaving open whether distributional β outperforms explicit mixture modeling on the same benchmarks. The +11.2 win rate result is from a single preprint on Pythia-2.8B and has not been replicated at larger scales or across multiple evaluators.
|
||||
|
||||
|
||||
### Additional Evidence (extend)
|
||||
*Source: [[2025-00-00-em-dpo-heterogeneous-preferences]] | Added: 2026-03-15*
|
||||
|
||||
EM-DPO provides an alternative mechanism: instead of modeling preference sensitivity as a distribution, use EM to discover discrete latent preference types and train separate models for each. Both approaches avoid demographic labels, but EM-DPO's discrete clustering may be more interpretable than continuous sensitivity distributions.
|
||||
|
||||
---
|
||||
|
||||
Relevant Notes:
|
||||
|
|
|
|||
|
|
@ -27,6 +27,12 @@ This claim directly addresses the mechanism gap identified in [[RLHF and DPO bot
|
|||
|
||||
The paper's proposed solution—RLCHF with explicit social welfare functions—connects to [[collective intelligence requires diversity as a structural precondition not a moral preference]] by formalizing how diverse evaluator input should be preserved rather than collapsed.
|
||||
|
||||
|
||||
### Additional Evidence (extend)
|
||||
*Source: [[2025-00-00-em-dpo-heterogeneous-preferences]] | Added: 2026-03-15*
|
||||
|
||||
EM-DPO provides formal proof that binary comparisons are insufficient for preference identifiability, which means standard pairwise RLHF is not just doing implicit social choice poorly—it's using a comparison structure that cannot mathematically represent preference diversity. Rankings over 3+ responses are necessary.
|
||||
|
||||
---
|
||||
|
||||
Relevant Notes:
|
||||
|
|
|
|||
|
|
@ -27,6 +27,12 @@ Chakraborty, Qiu, Yuan, Koppel, Manocha, Huang, Bedi, Wang. "MaxMin-RLHF: Alignm
|
|||
- GPT-2 experiment: single RLHF achieved positive sentiment but ignored conciseness
|
||||
- Tulu2-7B experiment: minority group accuracy dropped from 70.4% to 42% at 10:1 ratio
|
||||
|
||||
|
||||
### Additional Evidence (extend)
|
||||
*Source: [[2025-00-00-em-dpo-heterogeneous-preferences]] | Added: 2026-03-15*
|
||||
|
||||
EM-DPO demonstrates that the problem is deeper than single-reward optimization: even with multiple rewards, binary comparisons cannot identify the preference types that would inform how to construct those rewards. The architectural limitation is in the comparison structure, not just the aggregation.
|
||||
|
||||
---
|
||||
|
||||
Relevant Notes:
|
||||
|
|
|
|||
|
|
@ -0,0 +1,48 @@
|
|||
{
|
||||
"rejected_claims": [
|
||||
{
|
||||
"filename": "binary-preference-comparisons-cannot-identify-latent-preference-types-making-pairwise-rlhf-structurally-blind-to-diversity.md",
|
||||
"issues": [
|
||||
"missing_attribution_extractor"
|
||||
]
|
||||
},
|
||||
{
|
||||
"filename": "em-algorithm-discovers-preference-subpopulations-from-rankings-enabling-ensemble-alignment-without-demographic-labels.md",
|
||||
"issues": [
|
||||
"missing_attribution_extractor"
|
||||
]
|
||||
},
|
||||
{
|
||||
"filename": "minmax-regret-aggregation-ensures-no-preference-group-is-severely-underserved-during-ensemble-deployment.md",
|
||||
"issues": [
|
||||
"missing_attribution_extractor"
|
||||
]
|
||||
}
|
||||
],
|
||||
"validation_stats": {
|
||||
"total": 3,
|
||||
"kept": 0,
|
||||
"fixed": 11,
|
||||
"rejected": 3,
|
||||
"fixes_applied": [
|
||||
"binary-preference-comparisons-cannot-identify-latent-preference-types-making-pairwise-rlhf-structurally-blind-to-diversity.md:set_created:2026-03-15",
|
||||
"binary-preference-comparisons-cannot-identify-latent-preference-types-making-pairwise-rlhf-structurally-blind-to-diversity.md:stripped_wiki_link:rlhf-is-implicit-social-choice-without-normative-scrutiny.md",
|
||||
"binary-preference-comparisons-cannot-identify-latent-preference-types-making-pairwise-rlhf-structurally-blind-to-diversity.md:stripped_wiki_link:single-reward-rlhf-cannot-align-diverse-preferences-because-",
|
||||
"binary-preference-comparisons-cannot-identify-latent-preference-types-making-pairwise-rlhf-structurally-blind-to-diversity.md:stripped_wiki_link:some disagreements are permanently irreducible because they ",
|
||||
"em-algorithm-discovers-preference-subpopulations-from-rankings-enabling-ensemble-alignment-without-demographic-labels.md:set_created:2026-03-15",
|
||||
"em-algorithm-discovers-preference-subpopulations-from-rankings-enabling-ensemble-alignment-without-demographic-labels.md:stripped_wiki_link:modeling preference sensitivity as a learned distribution ra",
|
||||
"em-algorithm-discovers-preference-subpopulations-from-rankings-enabling-ensemble-alignment-without-demographic-labels.md:stripped_wiki_link:pluralistic alignment must accommodate irreducibly diverse v",
|
||||
"minmax-regret-aggregation-ensures-no-preference-group-is-severely-underserved-during-ensemble-deployment.md:set_created:2026-03-15",
|
||||
"minmax-regret-aggregation-ensures-no-preference-group-is-severely-underserved-during-ensemble-deployment.md:stripped_wiki_link:maxmin-rlhf-applies-egalitarian-social-choice-to-alignment-b",
|
||||
"minmax-regret-aggregation-ensures-no-preference-group-is-severely-underserved-during-ensemble-deployment.md:stripped_wiki_link:post-arrow-social-choice-mechanisms-work-by-weakening-indepe",
|
||||
"minmax-regret-aggregation-ensures-no-preference-group-is-severely-underserved-during-ensemble-deployment.md:stripped_wiki_link:pluralistic-ai-alignment-through-multiple-systems-preserves-"
|
||||
],
|
||||
"rejections": [
|
||||
"binary-preference-comparisons-cannot-identify-latent-preference-types-making-pairwise-rlhf-structurally-blind-to-diversity.md:missing_attribution_extractor",
|
||||
"em-algorithm-discovers-preference-subpopulations-from-rankings-enabling-ensemble-alignment-without-demographic-labels.md:missing_attribution_extractor",
|
||||
"minmax-regret-aggregation-ensures-no-preference-group-is-severely-underserved-during-ensemble-deployment.md:missing_attribution_extractor"
|
||||
]
|
||||
},
|
||||
"model": "anthropic/claude-sonnet-4.5",
|
||||
"date": "2026-03-15"
|
||||
}
|
||||
|
|
@ -7,9 +7,13 @@ date: 2025-01-01
|
|||
domain: ai-alignment
|
||||
secondary_domains: []
|
||||
format: paper
|
||||
status: unprocessed
|
||||
status: enrichment
|
||||
priority: medium
|
||||
tags: [pluralistic-alignment, EM-algorithm, preference-clustering, ensemble-LLM, fairness]
|
||||
processed_by: theseus
|
||||
processed_date: 2026-03-15
|
||||
enrichments_applied: ["rlhf-is-implicit-social-choice-without-normative-scrutiny.md", "single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md", "maxmin-rlhf-applies-egalitarian-social-choice-to-alignment-by-maximizing-minimum-utility-across-preference-groups.md", "modeling preference sensitivity as a learned distribution rather than a fixed scalar resolves DPO diversity failures without demographic labels or explicit user modeling.md"]
|
||||
extraction_model: "anthropic/claude-sonnet-4.5"
|
||||
---
|
||||
|
||||
## Content
|
||||
|
|
@ -39,3 +43,10 @@ EM-DPO uses expectation-maximization to simultaneously uncover latent user prefe
|
|||
PRIMARY CONNECTION: RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values
|
||||
WHY ARCHIVED: The binary-comparison insufficiency claim is a novel formal result that strengthens the case against standard alignment approaches
|
||||
EXTRACTION HINT: Focus on the formal insufficiency of binary comparisons and the EM + egalitarian aggregation combination
|
||||
|
||||
|
||||
## Key Facts
|
||||
- EM-DPO paper accepted to EAAMO 2025 (Equity and Access in Algorithms, Mechanisms, and Optimization)
|
||||
- EM-DPO requires rankings over 3+ responses for preference identifiability, not binary comparisons
|
||||
- MinMax Regret Aggregation (MMRA) is the deployment-time ensemble combination method in EM-DPO
|
||||
- EM-DPO uses expectation-maximization to jointly discover preference types and train type-specific models
|
||||
|
|
|
|||
Loading…
Reference in a new issue