From 2916d871e9acb21ac1b48126fa41c1faab4dfc37 Mon Sep 17 00:00:00 2001 From: Teleo Agents Date: Sun, 15 Mar 2026 18:55:27 +0000 Subject: [PATCH] extract: 2025-00-00-em-dpo-heterogeneous-preferences Pentagon-Agent: Ganymede --- ...inimum-utility-across-preference-groups.md | 6 +++ ...raphic labels or explicit user modeling.md | 6 +++ ...ocial-choice-without-normative-scrutiny.md | 6 +++ ...roportional-to-minority-distinctiveness.md | 6 +++ ...0-00-em-dpo-heterogeneous-preferences.json | 48 +++++++++++++++++++ ...-00-00-em-dpo-heterogeneous-preferences.md | 13 ++++- 6 files changed, 84 insertions(+), 1 deletion(-) create mode 100644 inbox/archive/.extraction-debug/2025-00-00-em-dpo-heterogeneous-preferences.json diff --git a/domains/ai-alignment/maxmin-rlhf-applies-egalitarian-social-choice-to-alignment-by-maximizing-minimum-utility-across-preference-groups.md b/domains/ai-alignment/maxmin-rlhf-applies-egalitarian-social-choice-to-alignment-by-maximizing-minimum-utility-across-preference-groups.md index 24e8a0e62..0b41c3c3b 100644 --- a/domains/ai-alignment/maxmin-rlhf-applies-egalitarian-social-choice-to-alignment-by-maximizing-minimum-utility-across-preference-groups.md +++ b/domains/ai-alignment/maxmin-rlhf-applies-egalitarian-social-choice-to-alignment-by-maximizing-minimum-utility-across-preference-groups.md @@ -37,6 +37,12 @@ Chakraborty et al., "MaxMin-RLHF: Alignment with Diverse Human Preferences," ICM - Tulu2-7B: 56.67% win rate across both groups vs 42% minority/70.4% majority for single reward - 33% improvement for minority groups without majority compromise + +### Additional Evidence (confirm) +*Source: [[2025-00-00-em-dpo-heterogeneous-preferences]] | Added: 2026-03-15* + +EM-DPO's MinMax Regret Aggregation independently implements egalitarian social choice for ensemble deployment, confirming that min-max fairness is a viable and practically effective approach to pluralistic alignment. MMRA ensures no preference group experiences severe underservice. + --- Relevant Notes: diff --git a/domains/ai-alignment/modeling preference sensitivity as a learned distribution rather than a fixed scalar resolves DPO diversity failures without demographic labels or explicit user modeling.md b/domains/ai-alignment/modeling preference sensitivity as a learned distribution rather than a fixed scalar resolves DPO diversity failures without demographic labels or explicit user modeling.md index 3308545c3..7b9ee0be9 100644 --- a/domains/ai-alignment/modeling preference sensitivity as a learned distribution rather than a fixed scalar resolves DPO diversity failures without demographic labels or explicit user modeling.md +++ b/domains/ai-alignment/modeling preference sensitivity as a learned distribution rather than a fixed scalar resolves DPO diversity failures without demographic labels or explicit user modeling.md @@ -28,6 +28,12 @@ Since [[pluralistic alignment must accommodate irreducibly diverse values simult MixDPO has not yet been compared to PAL or RLCF in the paper, leaving open whether distributional β outperforms explicit mixture modeling on the same benchmarks. The +11.2 win rate result is from a single preprint on Pythia-2.8B and has not been replicated at larger scales or across multiple evaluators. + +### Additional Evidence (extend) +*Source: [[2025-00-00-em-dpo-heterogeneous-preferences]] | Added: 2026-03-15* + +EM-DPO provides an alternative mechanism: instead of modeling preference sensitivity as a distribution, use EM to discover discrete latent preference types and train separate models for each. Both approaches avoid demographic labels, but EM-DPO's discrete clustering may be more interpretable than continuous sensitivity distributions. + --- Relevant Notes: diff --git a/domains/ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md b/domains/ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md index d8d679b81..32a4bda73 100644 --- a/domains/ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md +++ b/domains/ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md @@ -27,6 +27,12 @@ This claim directly addresses the mechanism gap identified in [[RLHF and DPO bot The paper's proposed solution—RLCHF with explicit social welfare functions—connects to [[collective intelligence requires diversity as a structural precondition not a moral preference]] by formalizing how diverse evaluator input should be preserved rather than collapsed. + +### Additional Evidence (extend) +*Source: [[2025-00-00-em-dpo-heterogeneous-preferences]] | Added: 2026-03-15* + +EM-DPO provides formal proof that binary comparisons are insufficient for preference identifiability, which means standard pairwise RLHF is not just doing implicit social choice poorly—it's using a comparison structure that cannot mathematically represent preference diversity. Rankings over 3+ responses are necessary. + --- Relevant Notes: diff --git a/domains/ai-alignment/single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md b/domains/ai-alignment/single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md index b587b34f8..4e0e3ee0b 100644 --- a/domains/ai-alignment/single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md +++ b/domains/ai-alignment/single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md @@ -27,6 +27,12 @@ Chakraborty, Qiu, Yuan, Koppel, Manocha, Huang, Bedi, Wang. "MaxMin-RLHF: Alignm - GPT-2 experiment: single RLHF achieved positive sentiment but ignored conciseness - Tulu2-7B experiment: minority group accuracy dropped from 70.4% to 42% at 10:1 ratio + +### Additional Evidence (extend) +*Source: [[2025-00-00-em-dpo-heterogeneous-preferences]] | Added: 2026-03-15* + +EM-DPO demonstrates that the problem is deeper than single-reward optimization: even with multiple rewards, binary comparisons cannot identify the preference types that would inform how to construct those rewards. The architectural limitation is in the comparison structure, not just the aggregation. + --- Relevant Notes: diff --git a/inbox/archive/.extraction-debug/2025-00-00-em-dpo-heterogeneous-preferences.json b/inbox/archive/.extraction-debug/2025-00-00-em-dpo-heterogeneous-preferences.json new file mode 100644 index 000000000..5f5b4cf4e --- /dev/null +++ b/inbox/archive/.extraction-debug/2025-00-00-em-dpo-heterogeneous-preferences.json @@ -0,0 +1,48 @@ +{ + "rejected_claims": [ + { + "filename": "binary-preference-comparisons-cannot-identify-latent-preference-types-making-pairwise-rlhf-structurally-blind-to-diversity.md", + "issues": [ + "missing_attribution_extractor" + ] + }, + { + "filename": "em-algorithm-discovers-preference-subpopulations-from-rankings-enabling-ensemble-alignment-without-demographic-labels.md", + "issues": [ + "missing_attribution_extractor" + ] + }, + { + "filename": "minmax-regret-aggregation-ensures-no-preference-group-is-severely-underserved-during-ensemble-deployment.md", + "issues": [ + "missing_attribution_extractor" + ] + } + ], + "validation_stats": { + "total": 3, + "kept": 0, + "fixed": 11, + "rejected": 3, + "fixes_applied": [ + "binary-preference-comparisons-cannot-identify-latent-preference-types-making-pairwise-rlhf-structurally-blind-to-diversity.md:set_created:2026-03-15", + "binary-preference-comparisons-cannot-identify-latent-preference-types-making-pairwise-rlhf-structurally-blind-to-diversity.md:stripped_wiki_link:rlhf-is-implicit-social-choice-without-normative-scrutiny.md", + "binary-preference-comparisons-cannot-identify-latent-preference-types-making-pairwise-rlhf-structurally-blind-to-diversity.md:stripped_wiki_link:single-reward-rlhf-cannot-align-diverse-preferences-because-", + "binary-preference-comparisons-cannot-identify-latent-preference-types-making-pairwise-rlhf-structurally-blind-to-diversity.md:stripped_wiki_link:some disagreements are permanently irreducible because they ", + "em-algorithm-discovers-preference-subpopulations-from-rankings-enabling-ensemble-alignment-without-demographic-labels.md:set_created:2026-03-15", + "em-algorithm-discovers-preference-subpopulations-from-rankings-enabling-ensemble-alignment-without-demographic-labels.md:stripped_wiki_link:modeling preference sensitivity as a learned distribution ra", + "em-algorithm-discovers-preference-subpopulations-from-rankings-enabling-ensemble-alignment-without-demographic-labels.md:stripped_wiki_link:pluralistic alignment must accommodate irreducibly diverse v", + "minmax-regret-aggregation-ensures-no-preference-group-is-severely-underserved-during-ensemble-deployment.md:set_created:2026-03-15", + "minmax-regret-aggregation-ensures-no-preference-group-is-severely-underserved-during-ensemble-deployment.md:stripped_wiki_link:maxmin-rlhf-applies-egalitarian-social-choice-to-alignment-b", + "minmax-regret-aggregation-ensures-no-preference-group-is-severely-underserved-during-ensemble-deployment.md:stripped_wiki_link:post-arrow-social-choice-mechanisms-work-by-weakening-indepe", + "minmax-regret-aggregation-ensures-no-preference-group-is-severely-underserved-during-ensemble-deployment.md:stripped_wiki_link:pluralistic-ai-alignment-through-multiple-systems-preserves-" + ], + "rejections": [ + "binary-preference-comparisons-cannot-identify-latent-preference-types-making-pairwise-rlhf-structurally-blind-to-diversity.md:missing_attribution_extractor", + "em-algorithm-discovers-preference-subpopulations-from-rankings-enabling-ensemble-alignment-without-demographic-labels.md:missing_attribution_extractor", + "minmax-regret-aggregation-ensures-no-preference-group-is-severely-underserved-during-ensemble-deployment.md:missing_attribution_extractor" + ] + }, + "model": "anthropic/claude-sonnet-4.5", + "date": "2026-03-15" +} \ No newline at end of file diff --git a/inbox/archive/2025-00-00-em-dpo-heterogeneous-preferences.md b/inbox/archive/2025-00-00-em-dpo-heterogeneous-preferences.md index 52de537f5..9ed8cf6b3 100644 --- a/inbox/archive/2025-00-00-em-dpo-heterogeneous-preferences.md +++ b/inbox/archive/2025-00-00-em-dpo-heterogeneous-preferences.md @@ -7,9 +7,13 @@ date: 2025-01-01 domain: ai-alignment secondary_domains: [] format: paper -status: unprocessed +status: enrichment priority: medium tags: [pluralistic-alignment, EM-algorithm, preference-clustering, ensemble-LLM, fairness] +processed_by: theseus +processed_date: 2026-03-15 +enrichments_applied: ["rlhf-is-implicit-social-choice-without-normative-scrutiny.md", "single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md", "maxmin-rlhf-applies-egalitarian-social-choice-to-alignment-by-maximizing-minimum-utility-across-preference-groups.md", "modeling preference sensitivity as a learned distribution rather than a fixed scalar resolves DPO diversity failures without demographic labels or explicit user modeling.md"] +extraction_model: "anthropic/claude-sonnet-4.5" --- ## Content @@ -39,3 +43,10 @@ EM-DPO uses expectation-maximization to simultaneously uncover latent user prefe PRIMARY CONNECTION: RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values WHY ARCHIVED: The binary-comparison insufficiency claim is a novel formal result that strengthens the case against standard alignment approaches EXTRACTION HINT: Focus on the formal insufficiency of binary comparisons and the EM + egalitarian aggregation combination + + +## Key Facts +- EM-DPO paper accepted to EAAMO 2025 (Equity and Access in Algorithms, Mechanisms, and Optimization) +- EM-DPO requires rankings over 3+ responses for preference identifiability, not binary comparisons +- MinMax Regret Aggregation (MMRA) is the deployment-time ensemble combination method in EM-DPO +- EM-DPO uses expectation-maximization to jointly discover preference types and train type-specific models