extract: 2025-00-00-em-dpo-heterogeneous-preferences

Pentagon-Agent: Ganymede <F99EBFA6-547B-4096-BEEA-1D59C3E4028A>
This commit is contained in:
Teleo Agents 2026-03-15 19:22:23 +00:00
parent 458aa7494e
commit fc5ca162ff
5 changed files with 69 additions and 1 deletions

View file

@ -25,6 +25,12 @@ Since [[universal alignment is mathematically impossible because Arrows impossib
MaxMin-RLHF provides a constructive implementation of pluralistic alignment through mixture-of-rewards and egalitarian optimization. Rather than converging preferences, it learns separate reward models for each subpopulation and optimizes for the worst-off group (Sen's Egalitarian principle). At Tulu2-7B scale, this achieved 56.67% win rate across both majority and minority groups, compared to single-reward's 70.4%/42% split. The mechanism accommodates irreducible diversity by maintaining separate reward functions rather than forcing convergence.
### Additional Evidence (confirm)
*Source: [[2025-00-00-em-dpo-heterogeneous-preferences]] | Added: 2026-03-15*
EM-DPO implements this through ensemble architecture where each preference type gets a specialized model, combined via egalitarian aggregation at deployment. Demonstrates concrete mechanism for simultaneous accommodation rather than convergence.
---
Relevant Notes:

View file

@ -27,6 +27,12 @@ This claim directly addresses the mechanism gap identified in [[RLHF and DPO bot
The paper's proposed solution—RLCHF with explicit social welfare functions—connects to [[collective intelligence requires diversity as a structural precondition not a moral preference]] by formalizing how diverse evaluator input should be preserved rather than collapsed.
### Additional Evidence (extend)
*Source: [[2025-00-00-em-dpo-heterogeneous-preferences]] | Added: 2026-03-15*
EM-DPO makes the social choice function explicit by using MinMax Regret Aggregation based on egalitarian fairness principles, demonstrating that pluralistic alignment requires conscious selection of aggregation criteria rather than implicit averaging through single reward functions.
---
Relevant Notes:

View file

@ -27,6 +27,12 @@ Chakraborty, Qiu, Yuan, Koppel, Manocha, Huang, Bedi, Wang. "MaxMin-RLHF: Alignm
- GPT-2 experiment: single RLHF achieved positive sentiment but ignored conciseness
- Tulu2-7B experiment: minority group accuracy dropped from 70.4% to 42% at 10:1 ratio
### Additional Evidence (extend)
*Source: [[2025-00-00-em-dpo-heterogeneous-preferences]] | Added: 2026-03-15*
EM-DPO provides formal proof that binary comparisons are structurally insufficient for preference identification, explaining WHY single-reward RLHF fails: the pairwise comparison data structure cannot represent heterogeneous preferences even in principle. Rankings over 3+ responses are mathematically required.
---
Relevant Notes:

View file

@ -0,0 +1,40 @@
{
"rejected_claims": [
{
"filename": "binary-preference-comparisons-are-formally-insufficient-for-latent-preference-identification.md",
"issues": [
"missing_attribution_extractor"
]
},
{
"filename": "em-algorithm-discovers-latent-preference-types-from-ranking-data-enabling-ensemble-alignment.md",
"issues": [
"missing_attribution_extractor"
]
},
{
"filename": "minmax-regret-aggregation-ensures-no-preference-group-is-severely-underserved-during-deployment.md",
"issues": [
"missing_attribution_extractor"
]
}
],
"validation_stats": {
"total": 3,
"kept": 0,
"fixed": 3,
"rejected": 3,
"fixes_applied": [
"binary-preference-comparisons-are-formally-insufficient-for-latent-preference-identification.md:set_created:2026-03-15",
"em-algorithm-discovers-latent-preference-types-from-ranking-data-enabling-ensemble-alignment.md:set_created:2026-03-15",
"minmax-regret-aggregation-ensures-no-preference-group-is-severely-underserved-during-deployment.md:set_created:2026-03-15"
],
"rejections": [
"binary-preference-comparisons-are-formally-insufficient-for-latent-preference-identification.md:missing_attribution_extractor",
"em-algorithm-discovers-latent-preference-types-from-ranking-data-enabling-ensemble-alignment.md:missing_attribution_extractor",
"minmax-regret-aggregation-ensures-no-preference-group-is-severely-underserved-during-deployment.md:missing_attribution_extractor"
]
},
"model": "anthropic/claude-sonnet-4.5",
"date": "2026-03-15"
}

View file

@ -7,9 +7,13 @@ date: 2025-01-01
domain: ai-alignment
secondary_domains: []
format: paper
status: unprocessed
status: enrichment
priority: medium
tags: [pluralistic-alignment, EM-algorithm, preference-clustering, ensemble-LLM, fairness]
processed_by: theseus
processed_date: 2026-03-15
enrichments_applied: ["single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md", "rlhf-is-implicit-social-choice-without-normative-scrutiny.md", "pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state.md"]
extraction_model: "anthropic/claude-sonnet-4.5"
---
## Content
@ -39,3 +43,9 @@ EM-DPO uses expectation-maximization to simultaneously uncover latent user prefe
PRIMARY CONNECTION: RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values
WHY ARCHIVED: The binary-comparison insufficiency claim is a novel formal result that strengthens the case against standard alignment approaches
EXTRACTION HINT: Focus on the formal insufficiency of binary comparisons and the EM + egalitarian aggregation combination
## Key Facts
- EM-DPO paper accepted at EAAMO 2025 (Equity and Access in Algorithms, Mechanisms, and Optimization)
- MMRA aggregation uses time-weighted regret minimization across discovered preference clusters
- EM algorithm alternates between assigning users to preference types (E-step) and training specialized models (M-step)