extract: 2025-00-00-em-dpo-heterogeneous-preferences

Pentagon-Agent: Ganymede <F99EBFA6-547B-4096-BEEA-1D59C3E4028A>
2026-03-16 12:48:34 +00:00 · 2026-03-16 12:48:34 +00:00 · 06d01eb28b
commit 06d01eb28b
parent a16926c5b5
6 changed files with 85 additions and 1 deletions
--- a/domains/ai-alignment/maxmin-rlhf-applies-egalitarian-social-choice-to-alignment-by-maximizing-minimum-utility-across-preference-groups.md
+++ b/domains/ai-alignment/maxmin-rlhf-applies-egalitarian-social-choice-to-alignment-by-maximizing-minimum-utility-across-preference-groups.md
@ -37,6 +37,12 @@ Chakraborty et al., "MaxMin-RLHF: Alignment with Diverse Human Preferences," ICM
 - Tulu2-7B: 56.67% win rate across both groups vs 42% minority/70.4% majority for single reward
 - 33% improvement for minority groups without majority compromise

+
+### Additional Evidence (extend)
+*Source: [[2025-00-00-em-dpo-heterogeneous-preferences]] | Added: 2026-03-16*
+
+MinMax Regret Aggregation provides an alternative egalitarian mechanism that works at inference time with ensemble models rather than during training with a single reward function
+
 ---

 Relevant Notes:
--- a/domains/ai-alignment/pluralistic
+++ b/domains/ai-alignment/pluralistic
@ -25,6 +25,12 @@ Since [[universal alignment is mathematically impossible because Arrows impossib

 MaxMin-RLHF provides a constructive implementation of pluralistic alignment through mixture-of-rewards and egalitarian optimization. Rather than converging preferences, it learns separate reward models for each subpopulation and optimizes for the worst-off group (Sen's Egalitarian principle). At Tulu2-7B scale, this achieved 56.67% win rate across both majority and minority groups, compared to single-reward's 70.4%/42% split. The mechanism accommodates irreducible diversity by maintaining separate reward functions rather than forcing convergence.

+
+### Additional Evidence (confirm)
+*Source: [[2025-00-00-em-dpo-heterogeneous-preferences]] | Added: 2026-03-16*
+
+EM-DPO implements this through type-specific models with egalitarian aggregation, providing a concrete mechanism for maintaining value diversity rather than forcing convergence
+
 ---

 Relevant Notes:
--- a/domains/ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md
+++ b/domains/ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md
@ -33,6 +33,12 @@ The paper's proposed solution—RLCHF with explicit social welfare functions—c

 RLCF makes the social choice mechanism explicit through the bridging algorithm (matrix factorization with intercept scores). Unlike standard RLHF which aggregates preferences opaquely through reward model training, RLCF's use of intercepts as the training signal is a deliberate choice to optimize for cross-partisan agreement—a specific social welfare function.

+
+### Additional Evidence (extend)
+*Source: [[2025-00-00-em-dpo-heterogeneous-preferences]] | Added: 2026-03-16*
+
+EM-DPO demonstrates that the problem is deeper than aggregation method—the binary comparison format itself is mathematically insufficient for preference type identification, meaning standard RLHF cannot even detect heterogeneity to aggregate
+
 ---

 Relevant Notes:
--- a/domains/ai-alignment/single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md
+++ b/domains/ai-alignment/single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md
@ -33,6 +33,12 @@ Chakraborty, Qiu, Yuan, Koppel, Manocha, Huang, Bedi, Wang. "MaxMin-RLHF: Alignm

 Study demonstrates that models trained on different demographic populations show measurable behavioral divergence (3-5 percentage points), providing empirical evidence that single-reward functions trained on one population systematically misalign with others.

+
+### Additional Evidence (confirm)
+*Source: [[2025-00-00-em-dpo-heterogeneous-preferences]] | Added: 2026-03-16*
+
+EM-DPO provides formal identifiability proof that pairwise comparisons cannot recover latent preference structure, explaining why single-reward approaches systematically fail at diversity
+
 ---

 Relevant Notes:
--- a/inbox/archive/.extraction-debug/2025-00-00-em-dpo-heterogeneous-preferences.json
+++ b/inbox/archive/.extraction-debug/2025-00-00-em-dpo-heterogeneous-preferences.json
@ -0,0 +1,49 @@
+{
+  "rejected_claims": [
+    {
+      "filename": "binary-preference-comparisons-cannot-identify-latent-preference-types-making-pairwise-rlhf-structurally-blind-to-diversity.md",
+      "issues": [
+        "missing_attribution_extractor"
+      ]
+    },
+    {
+      "filename": "em-based-preference-clustering-with-type-specific-models-outperforms-single-reward-alignment-by-discovering-latent-subpopulations.md",
+      "issues": [
+        "missing_attribution_extractor"
+      ]
+    },
+    {
+      "filename": "minmax-regret-aggregation-implements-egalitarian-fairness-for-pluralistic-deployment-when-user-preference-type-is-unknown.md",
+      "issues": [
+        "missing_attribution_extractor"
+      ]
+    }
+  ],
+  "validation_stats": {
+    "total": 3,
+    "kept": 0,
+    "fixed": 12,
+    "rejected": 3,
+    "fixes_applied": [
+      "binary-preference-comparisons-cannot-identify-latent-preference-types-making-pairwise-rlhf-structurally-blind-to-diversity.md:set_created:2026-03-16",
+      "binary-preference-comparisons-cannot-identify-latent-preference-types-making-pairwise-rlhf-structurally-blind-to-diversity.md:stripped_wiki_link:rlhf-is-implicit-social-choice-without-normative-scrutiny.md",
+      "binary-preference-comparisons-cannot-identify-latent-preference-types-making-pairwise-rlhf-structurally-blind-to-diversity.md:stripped_wiki_link:single-reward-rlhf-cannot-align-diverse-preferences-because-",
+      "binary-preference-comparisons-cannot-identify-latent-preference-types-making-pairwise-rlhf-structurally-blind-to-diversity.md:stripped_wiki_link:some disagreements are permanently irreducible because they ",
+      "em-based-preference-clustering-with-type-specific-models-outperforms-single-reward-alignment-by-discovering-latent-subpopulations.md:set_created:2026-03-16",
+      "em-based-preference-clustering-with-type-specific-models-outperforms-single-reward-alignment-by-discovering-latent-subpopulations.md:stripped_wiki_link:modeling preference sensitivity as a learned distribution ra",
+      "em-based-preference-clustering-with-type-specific-models-outperforms-single-reward-alignment-by-discovering-latent-subpopulations.md:stripped_wiki_link:pluralistic alignment must accommodate irreducibly diverse v",
+      "em-based-preference-clustering-with-type-specific-models-outperforms-single-reward-alignment-by-discovering-latent-subpopulations.md:stripped_wiki_link:minority-preference-alignment-improves-33-percent-without-ma",
+      "minmax-regret-aggregation-implements-egalitarian-fairness-for-pluralistic-deployment-when-user-preference-type-is-unknown.md:set_created:2026-03-16",
+      "minmax-regret-aggregation-implements-egalitarian-fairness-for-pluralistic-deployment-when-user-preference-type-is-unknown.md:stripped_wiki_link:maxmin-rlhf-applies-egalitarian-social-choice-to-alignment-b",
+      "minmax-regret-aggregation-implements-egalitarian-fairness-for-pluralistic-deployment-when-user-preference-type-is-unknown.md:stripped_wiki_link:post-arrow-social-choice-mechanisms-work-by-weakening-indepe",
+      "minmax-regret-aggregation-implements-egalitarian-fairness-for-pluralistic-deployment-when-user-preference-type-is-unknown.md:stripped_wiki_link:pluralistic-ai-alignment-through-multiple-systems-preserves-"
+    ],
+    "rejections": [
+      "binary-preference-comparisons-cannot-identify-latent-preference-types-making-pairwise-rlhf-structurally-blind-to-diversity.md:missing_attribution_extractor",
+      "em-based-preference-clustering-with-type-specific-models-outperforms-single-reward-alignment-by-discovering-latent-subpopulations.md:missing_attribution_extractor",
+      "minmax-regret-aggregation-implements-egalitarian-fairness-for-pluralistic-deployment-when-user-preference-type-is-unknown.md:missing_attribution_extractor"
+    ]
+  },
+  "model": "anthropic/claude-sonnet-4.5",
+  "date": "2026-03-16"
+}
--- a/inbox/archive/2025-00-00-em-dpo-heterogeneous-preferences.md
+++ b/inbox/archive/2025-00-00-em-dpo-heterogeneous-preferences.md
@ -7,9 +7,13 @@ date: 2025-01-01
 domain: ai-alignment
 secondary_domains: []
 format: paper
-status: unprocessed
+status: enrichment
 priority: medium
 tags: [pluralistic-alignment, EM-algorithm, preference-clustering, ensemble-LLM, fairness]
+processed_by: theseus
+processed_date: 2026-03-16
+enrichments_applied: ["rlhf-is-implicit-social-choice-without-normative-scrutiny.md", "single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md", "maxmin-rlhf-applies-egalitarian-social-choice-to-alignment-by-maximizing-minimum-utility-across-preference-groups.md", "pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state.md"]
+extraction_model: "anthropic/claude-sonnet-4.5"
 ---

 ## Content
@ -39,3 +43,10 @@ EM-DPO uses expectation-maximization to simultaneously uncover latent user prefe
 PRIMARY CONNECTION: RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values
 WHY ARCHIVED: The binary-comparison insufficiency claim is a novel formal result that strengthens the case against standard alignment approaches
 EXTRACTION HINT: Focus on the formal insufficiency of binary comparisons and the EM + egalitarian aggregation combination
+
+
+## Key Facts
+- EM-DPO presented at EAAMO 2025 (Equity and Access in Algorithms, Mechanisms, and Optimization)
+- The algorithm requires rankings over 3+ responses rather than pairwise comparisons
+- MMRA is based on egalitarian social choice theory and min-max regret fairness criterion
+- The approach discovers preference types without demographic labels or pre-specified cluster counts