extract: 2025-00-00-em-dpo-heterogeneous-preferences

Pentagon-Agent: Ganymede <F99EBFA6-547B-4096-BEEA-1D59C3E4028A>
2026-03-15 19:22:23 +00:00 · 2026-03-15 19:22:23 +00:00 · fc5ca162ff
commit fc5ca162ff
parent 458aa7494e
5 changed files with 69 additions and 1 deletions
--- a/domains/ai-alignment/pluralistic
+++ b/domains/ai-alignment/pluralistic
@ -25,6 +25,12 @@ Since [[universal alignment is mathematically impossible because Arrows impossib

 MaxMin-RLHF provides a constructive implementation of pluralistic alignment through mixture-of-rewards and egalitarian optimization. Rather than converging preferences, it learns separate reward models for each subpopulation and optimizes for the worst-off group (Sen's Egalitarian principle). At Tulu2-7B scale, this achieved 56.67% win rate across both majority and minority groups, compared to single-reward's 70.4%/42% split. The mechanism accommodates irreducible diversity by maintaining separate reward functions rather than forcing convergence.

+
+### Additional Evidence (confirm)
+*Source: [[2025-00-00-em-dpo-heterogeneous-preferences]] | Added: 2026-03-15*
+
+EM-DPO implements this through ensemble architecture where each preference type gets a specialized model, combined via egalitarian aggregation at deployment. Demonstrates concrete mechanism for simultaneous accommodation rather than convergence.
+
 ---

 Relevant Notes:
--- a/domains/ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md
+++ b/domains/ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md
@ -27,6 +27,12 @@ This claim directly addresses the mechanism gap identified in [[RLHF and DPO bot

 The paper's proposed solution—RLCHF with explicit social welfare functions—connects to [[collective intelligence requires diversity as a structural precondition not a moral preference]] by formalizing how diverse evaluator input should be preserved rather than collapsed.

+
+### Additional Evidence (extend)
+*Source: [[2025-00-00-em-dpo-heterogeneous-preferences]] | Added: 2026-03-15*
+
+EM-DPO makes the social choice function explicit by using MinMax Regret Aggregation based on egalitarian fairness principles, demonstrating that pluralistic alignment requires conscious selection of aggregation criteria rather than implicit averaging through single reward functions.
+
 ---

 Relevant Notes:
--- a/domains/ai-alignment/single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md
+++ b/domains/ai-alignment/single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md
@ -27,6 +27,12 @@ Chakraborty, Qiu, Yuan, Koppel, Manocha, Huang, Bedi, Wang. "MaxMin-RLHF: Alignm
 - GPT-2 experiment: single RLHF achieved positive sentiment but ignored conciseness
 - Tulu2-7B experiment: minority group accuracy dropped from 70.4% to 42% at 10:1 ratio

+
+### Additional Evidence (extend)
+*Source: [[2025-00-00-em-dpo-heterogeneous-preferences]] | Added: 2026-03-15*
+
+EM-DPO provides formal proof that binary comparisons are structurally insufficient for preference identification, explaining WHY single-reward RLHF fails: the pairwise comparison data structure cannot represent heterogeneous preferences even in principle. Rankings over 3+ responses are mathematically required.
+
 ---

 Relevant Notes:
--- a/inbox/archive/.extraction-debug/2025-00-00-em-dpo-heterogeneous-preferences.json
+++ b/inbox/archive/.extraction-debug/2025-00-00-em-dpo-heterogeneous-preferences.json
@ -0,0 +1,40 @@
+{
+  "rejected_claims": [
+    {
+      "filename": "binary-preference-comparisons-are-formally-insufficient-for-latent-preference-identification.md",
+      "issues": [
+        "missing_attribution_extractor"
+      ]
+    },
+    {
+      "filename": "em-algorithm-discovers-latent-preference-types-from-ranking-data-enabling-ensemble-alignment.md",
+      "issues": [
+        "missing_attribution_extractor"
+      ]
+    },
+    {
+      "filename": "minmax-regret-aggregation-ensures-no-preference-group-is-severely-underserved-during-deployment.md",
+      "issues": [
+        "missing_attribution_extractor"
+      ]
+    }
+  ],
+  "validation_stats": {
+    "total": 3,
+    "kept": 0,
+    "fixed": 3,
+    "rejected": 3,
+    "fixes_applied": [
+      "binary-preference-comparisons-are-formally-insufficient-for-latent-preference-identification.md:set_created:2026-03-15",
+      "em-algorithm-discovers-latent-preference-types-from-ranking-data-enabling-ensemble-alignment.md:set_created:2026-03-15",
+      "minmax-regret-aggregation-ensures-no-preference-group-is-severely-underserved-during-deployment.md:set_created:2026-03-15"
+    ],
+    "rejections": [
+      "binary-preference-comparisons-are-formally-insufficient-for-latent-preference-identification.md:missing_attribution_extractor",
+      "em-algorithm-discovers-latent-preference-types-from-ranking-data-enabling-ensemble-alignment.md:missing_attribution_extractor",
+      "minmax-regret-aggregation-ensures-no-preference-group-is-severely-underserved-during-deployment.md:missing_attribution_extractor"
+    ]
+  },
+  "model": "anthropic/claude-sonnet-4.5",
+  "date": "2026-03-15"
+}
--- a/inbox/archive/2025-00-00-em-dpo-heterogeneous-preferences.md
+++ b/inbox/archive/2025-00-00-em-dpo-heterogeneous-preferences.md
@ -7,9 +7,13 @@ date: 2025-01-01
 domain: ai-alignment
 secondary_domains: []
 format: paper
-status: unprocessed
+status: enrichment
 priority: medium
 tags: [pluralistic-alignment, EM-algorithm, preference-clustering, ensemble-LLM, fairness]
+processed_by: theseus
+processed_date: 2026-03-15
+enrichments_applied: ["single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md", "rlhf-is-implicit-social-choice-without-normative-scrutiny.md", "pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state.md"]
+extraction_model: "anthropic/claude-sonnet-4.5"
 ---

 ## Content
@ -39,3 +43,9 @@ EM-DPO uses expectation-maximization to simultaneously uncover latent user prefe
 PRIMARY CONNECTION: RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values
 WHY ARCHIVED: The binary-comparison insufficiency claim is a novel formal result that strengthens the case against standard alignment approaches
 EXTRACTION HINT: Focus on the formal insufficiency of binary comparisons and the EM + egalitarian aggregation combination
+
+
+## Key Facts
+- EM-DPO paper accepted at EAAMO 2025 (Equity and Access in Algorithms, Mechanisms, and Optimization)
+- MMRA aggregation uses time-weighted regret minimization across discovered preference clusters
+- EM algorithm alternates between assigning users to preference types (E-step) and training specialized models (M-step)