From 74975eb326f3a50ac42e467209d4aa855be2315d Mon Sep 17 00:00:00 2001 From: Teleo Agents Date: Mon, 16 Mar 2026 14:47:39 +0000 Subject: [PATCH] extract: 2025-00-00-em-dpo-heterogeneous-preferences Pentagon-Agent: Ganymede --- ...inimum-utility-across-preference-groups.md | 6 +++ ...an converging on a single aligned state.md | 6 +++ ...ocial-choice-without-normative-scrutiny.md | 6 +++ ...roportional-to-minority-distinctiveness.md | 6 +++ ...0-00-em-dpo-heterogeneous-preferences.json | 48 +++++++++++++++++++ ...-00-00-em-dpo-heterogeneous-preferences.md | 13 ++++- 6 files changed, 84 insertions(+), 1 deletion(-) create mode 100644 inbox/archive/.extraction-debug/2025-00-00-em-dpo-heterogeneous-preferences.json diff --git a/domains/ai-alignment/maxmin-rlhf-applies-egalitarian-social-choice-to-alignment-by-maximizing-minimum-utility-across-preference-groups.md b/domains/ai-alignment/maxmin-rlhf-applies-egalitarian-social-choice-to-alignment-by-maximizing-minimum-utility-across-preference-groups.md index 24e8a0e6..56fbce1e 100644 --- a/domains/ai-alignment/maxmin-rlhf-applies-egalitarian-social-choice-to-alignment-by-maximizing-minimum-utility-across-preference-groups.md +++ b/domains/ai-alignment/maxmin-rlhf-applies-egalitarian-social-choice-to-alignment-by-maximizing-minimum-utility-across-preference-groups.md @@ -37,6 +37,12 @@ Chakraborty et al., "MaxMin-RLHF: Alignment with Diverse Human Preferences," ICM - Tulu2-7B: 56.67% win rate across both groups vs 42% minority/70.4% majority for single reward - 33% improvement for minority groups without majority compromise + +### Additional Evidence (extend) +*Source: [[2025-00-00-em-dpo-heterogeneous-preferences]] | Added: 2026-03-16* + +MMRA extends maxmin RLHF to the deployment phase by minimizing maximum regret across preference groups when user type is unknown at inference, showing how egalitarian principles can govern both training and inference in pluralistic systems. + --- Relevant Notes: diff --git a/domains/ai-alignment/pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state.md b/domains/ai-alignment/pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state.md index f97a0c88..1436e6d9 100644 --- a/domains/ai-alignment/pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state.md +++ b/domains/ai-alignment/pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state.md @@ -25,6 +25,12 @@ Since [[universal alignment is mathematically impossible because Arrows impossib MaxMin-RLHF provides a constructive implementation of pluralistic alignment through mixture-of-rewards and egalitarian optimization. Rather than converging preferences, it learns separate reward models for each subpopulation and optimizes for the worst-off group (Sen's Egalitarian principle). At Tulu2-7B scale, this achieved 56.67% win rate across both majority and minority groups, compared to single-reward's 70.4%/42% split. The mechanism accommodates irreducible diversity by maintaining separate reward functions rather than forcing convergence. + +### Additional Evidence (confirm) +*Source: [[2025-00-00-em-dpo-heterogeneous-preferences]] | Added: 2026-03-16* + +EM-DPO implements this through ensemble architecture: discovers K latent preference types, trains K specialized models, and deploys them simultaneously with egalitarian aggregation. Demonstrates that pluralistic alignment is technically feasible without requiring demographic labels or manual preference specification. + --- Relevant Notes: diff --git a/domains/ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md b/domains/ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md index 6ae355b1..b2cd0239 100644 --- a/domains/ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md +++ b/domains/ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md @@ -39,6 +39,12 @@ RLCF makes the social choice mechanism explicit through the bridging algorithm ( Comprehensive February 2026 survey by An & Du documents that contemporary ML systems implement social choice mechanisms implicitly across RLHF, participatory budgeting, and liquid democracy applications, with 18 identified open problems spanning incentive guarantees and pluralistic preference aggregation. + +### Additional Evidence (extend) +*Source: [[2025-00-00-em-dpo-heterogeneous-preferences]] | Added: 2026-03-16* + +EM-DPO makes the social choice function explicit by using MinMax Regret Aggregation based on egalitarian fairness principles, demonstrating that pluralistic alignment requires choosing a specific social welfare function (here: maximin regret) rather than pretending aggregation is value-neutral. + --- Relevant Notes: diff --git a/domains/ai-alignment/single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md b/domains/ai-alignment/single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md index ddeaf7b8..e8d6edba 100644 --- a/domains/ai-alignment/single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md +++ b/domains/ai-alignment/single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md @@ -39,6 +39,12 @@ Study demonstrates that models trained on different demographic populations show An & Du's survey reveals the mechanism behind single-reward failure: RLHF is doing social choice (preference aggregation) but treating it as an engineering detail rather than a normative design choice, which means the aggregation function is chosen implicitly and without examination of which fairness criteria it satisfies. + +### Additional Evidence (extend) +*Source: [[2025-00-00-em-dpo-heterogeneous-preferences]] | Added: 2026-03-16* + +EM-DPO provides formal proof that binary comparisons are mathematically insufficient for preference type identification, explaining WHY single-reward RLHF fails: the training signal format cannot contain the information needed to discover heterogeneity, regardless of dataset size. Rankings over 3+ responses are necessary. + --- Relevant Notes: diff --git a/inbox/archive/.extraction-debug/2025-00-00-em-dpo-heterogeneous-preferences.json b/inbox/archive/.extraction-debug/2025-00-00-em-dpo-heterogeneous-preferences.json new file mode 100644 index 00000000..d80fd366 --- /dev/null +++ b/inbox/archive/.extraction-debug/2025-00-00-em-dpo-heterogeneous-preferences.json @@ -0,0 +1,48 @@ +{ + "rejected_claims": [ + { + "filename": "binary-preference-comparisons-cannot-identify-latent-preference-types-making-pairwise-rlhf-structurally-blind-to-diversity.md", + "issues": [ + "missing_attribution_extractor" + ] + }, + { + "filename": "em-algorithm-preference-clustering-discovers-latent-user-types-without-demographic-labels-enabling-unsupervised-pluralistic-alignment.md", + "issues": [ + "missing_attribution_extractor" + ] + }, + { + "filename": "minmax-regret-aggregation-ensures-no-preference-group-is-severely-underserved-by-applying-egalitarian-fairness-to-ensemble-deployment.md", + "issues": [ + "missing_attribution_extractor" + ] + } + ], + "validation_stats": { + "total": 3, + "kept": 0, + "fixed": 11, + "rejected": 3, + "fixes_applied": [ + "binary-preference-comparisons-cannot-identify-latent-preference-types-making-pairwise-rlhf-structurally-blind-to-diversity.md:set_created:2026-03-16", + "binary-preference-comparisons-cannot-identify-latent-preference-types-making-pairwise-rlhf-structurally-blind-to-diversity.md:stripped_wiki_link:single-reward-rlhf-cannot-align-diverse-preferences-because-", + "binary-preference-comparisons-cannot-identify-latent-preference-types-making-pairwise-rlhf-structurally-blind-to-diversity.md:stripped_wiki_link:rlhf-is-implicit-social-choice-without-normative-scrutiny.md", + "binary-preference-comparisons-cannot-identify-latent-preference-types-making-pairwise-rlhf-structurally-blind-to-diversity.md:stripped_wiki_link:pluralistic alignment must accommodate irreducibly diverse v", + "em-algorithm-preference-clustering-discovers-latent-user-types-without-demographic-labels-enabling-unsupervised-pluralistic-alignment.md:set_created:2026-03-16", + "em-algorithm-preference-clustering-discovers-latent-user-types-without-demographic-labels-enabling-unsupervised-pluralistic-alignment.md:stripped_wiki_link:modeling preference sensitivity as a learned distribution ra", + "em-algorithm-preference-clustering-discovers-latent-user-types-without-demographic-labels-enabling-unsupervised-pluralistic-alignment.md:stripped_wiki_link:pluralistic alignment must accommodate irreducibly diverse v", + "minmax-regret-aggregation-ensures-no-preference-group-is-severely-underserved-by-applying-egalitarian-fairness-to-ensemble-deployment.md:set_created:2026-03-16", + "minmax-regret-aggregation-ensures-no-preference-group-is-severely-underserved-by-applying-egalitarian-fairness-to-ensemble-deployment.md:stripped_wiki_link:maxmin-rlhf-applies-egalitarian-social-choice-to-alignment-b", + "minmax-regret-aggregation-ensures-no-preference-group-is-severely-underserved-by-applying-egalitarian-fairness-to-ensemble-deployment.md:stripped_wiki_link:post-arrow-social-choice-mechanisms-work-by-weakening-indepe", + "minmax-regret-aggregation-ensures-no-preference-group-is-severely-underserved-by-applying-egalitarian-fairness-to-ensemble-deployment.md:stripped_wiki_link:minority-preference-alignment-improves-33-percent-without-ma" + ], + "rejections": [ + "binary-preference-comparisons-cannot-identify-latent-preference-types-making-pairwise-rlhf-structurally-blind-to-diversity.md:missing_attribution_extractor", + "em-algorithm-preference-clustering-discovers-latent-user-types-without-demographic-labels-enabling-unsupervised-pluralistic-alignment.md:missing_attribution_extractor", + "minmax-regret-aggregation-ensures-no-preference-group-is-severely-underserved-by-applying-egalitarian-fairness-to-ensemble-deployment.md:missing_attribution_extractor" + ] + }, + "model": "anthropic/claude-sonnet-4.5", + "date": "2026-03-16" +} \ No newline at end of file diff --git a/inbox/archive/2025-00-00-em-dpo-heterogeneous-preferences.md b/inbox/archive/2025-00-00-em-dpo-heterogeneous-preferences.md index 52de537f..246dec8a 100644 --- a/inbox/archive/2025-00-00-em-dpo-heterogeneous-preferences.md +++ b/inbox/archive/2025-00-00-em-dpo-heterogeneous-preferences.md @@ -7,9 +7,13 @@ date: 2025-01-01 domain: ai-alignment secondary_domains: [] format: paper -status: unprocessed +status: enrichment priority: medium tags: [pluralistic-alignment, EM-algorithm, preference-clustering, ensemble-LLM, fairness] +processed_by: theseus +processed_date: 2026-03-16 +enrichments_applied: ["single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md", "rlhf-is-implicit-social-choice-without-normative-scrutiny.md", "pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state.md", "maxmin-rlhf-applies-egalitarian-social-choice-to-alignment-by-maximizing-minimum-utility-across-preference-groups.md"] +extraction_model: "anthropic/claude-sonnet-4.5" --- ## Content @@ -39,3 +43,10 @@ EM-DPO uses expectation-maximization to simultaneously uncover latent user prefe PRIMARY CONNECTION: RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values WHY ARCHIVED: The binary-comparison insufficiency claim is a novel formal result that strengthens the case against standard alignment approaches EXTRACTION HINT: Focus on the formal insufficiency of binary comparisons and the EM + egalitarian aggregation combination + + +## Key Facts +- EM-DPO presented at EAAMO 2025 (Equity and Access in Algorithms, Mechanisms, and Optimization) +- EM-DPO uses rankings over 3+ responses rather than binary comparisons for preference data +- MinMax Regret Aggregation is based on egalitarian social choice theory +- The paper focuses on fairness rather than efficiency, distinguishing it from PAL's approach