From fc5ca162ffefeb82ca494b8d57f52fac254a5479 Mon Sep 17 00:00:00 2001 From: Teleo Agents Date: Sun, 15 Mar 2026 19:22:23 +0000 Subject: [PATCH] extract: 2025-00-00-em-dpo-heterogeneous-preferences Pentagon-Agent: Ganymede --- ...an converging on a single aligned state.md | 6 +++ ...ocial-choice-without-normative-scrutiny.md | 6 +++ ...roportional-to-minority-distinctiveness.md | 6 +++ ...0-00-em-dpo-heterogeneous-preferences.json | 40 +++++++++++++++++++ ...-00-00-em-dpo-heterogeneous-preferences.md | 12 +++++- 5 files changed, 69 insertions(+), 1 deletion(-) create mode 100644 inbox/archive/.extraction-debug/2025-00-00-em-dpo-heterogeneous-preferences.json diff --git a/domains/ai-alignment/pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state.md b/domains/ai-alignment/pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state.md index f97a0c886..4dade9ff1 100644 --- a/domains/ai-alignment/pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state.md +++ b/domains/ai-alignment/pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state.md @@ -25,6 +25,12 @@ Since [[universal alignment is mathematically impossible because Arrows impossib MaxMin-RLHF provides a constructive implementation of pluralistic alignment through mixture-of-rewards and egalitarian optimization. Rather than converging preferences, it learns separate reward models for each subpopulation and optimizes for the worst-off group (Sen's Egalitarian principle). At Tulu2-7B scale, this achieved 56.67% win rate across both majority and minority groups, compared to single-reward's 70.4%/42% split. The mechanism accommodates irreducible diversity by maintaining separate reward functions rather than forcing convergence. + +### Additional Evidence (confirm) +*Source: [[2025-00-00-em-dpo-heterogeneous-preferences]] | Added: 2026-03-15* + +EM-DPO implements this through ensemble architecture where each preference type gets a specialized model, combined via egalitarian aggregation at deployment. Demonstrates concrete mechanism for simultaneous accommodation rather than convergence. + --- Relevant Notes: diff --git a/domains/ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md b/domains/ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md index d8d679b81..cecf162cb 100644 --- a/domains/ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md +++ b/domains/ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md @@ -27,6 +27,12 @@ This claim directly addresses the mechanism gap identified in [[RLHF and DPO bot The paper's proposed solution—RLCHF with explicit social welfare functions—connects to [[collective intelligence requires diversity as a structural precondition not a moral preference]] by formalizing how diverse evaluator input should be preserved rather than collapsed. + +### Additional Evidence (extend) +*Source: [[2025-00-00-em-dpo-heterogeneous-preferences]] | Added: 2026-03-15* + +EM-DPO makes the social choice function explicit by using MinMax Regret Aggregation based on egalitarian fairness principles, demonstrating that pluralistic alignment requires conscious selection of aggregation criteria rather than implicit averaging through single reward functions. + --- Relevant Notes: diff --git a/domains/ai-alignment/single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md b/domains/ai-alignment/single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md index b587b34f8..3ec825433 100644 --- a/domains/ai-alignment/single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md +++ b/domains/ai-alignment/single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md @@ -27,6 +27,12 @@ Chakraborty, Qiu, Yuan, Koppel, Manocha, Huang, Bedi, Wang. "MaxMin-RLHF: Alignm - GPT-2 experiment: single RLHF achieved positive sentiment but ignored conciseness - Tulu2-7B experiment: minority group accuracy dropped from 70.4% to 42% at 10:1 ratio + +### Additional Evidence (extend) +*Source: [[2025-00-00-em-dpo-heterogeneous-preferences]] | Added: 2026-03-15* + +EM-DPO provides formal proof that binary comparisons are structurally insufficient for preference identification, explaining WHY single-reward RLHF fails: the pairwise comparison data structure cannot represent heterogeneous preferences even in principle. Rankings over 3+ responses are mathematically required. + --- Relevant Notes: diff --git a/inbox/archive/.extraction-debug/2025-00-00-em-dpo-heterogeneous-preferences.json b/inbox/archive/.extraction-debug/2025-00-00-em-dpo-heterogeneous-preferences.json new file mode 100644 index 000000000..ed9ca33c0 --- /dev/null +++ b/inbox/archive/.extraction-debug/2025-00-00-em-dpo-heterogeneous-preferences.json @@ -0,0 +1,40 @@ +{ + "rejected_claims": [ + { + "filename": "binary-preference-comparisons-are-formally-insufficient-for-latent-preference-identification.md", + "issues": [ + "missing_attribution_extractor" + ] + }, + { + "filename": "em-algorithm-discovers-latent-preference-types-from-ranking-data-enabling-ensemble-alignment.md", + "issues": [ + "missing_attribution_extractor" + ] + }, + { + "filename": "minmax-regret-aggregation-ensures-no-preference-group-is-severely-underserved-during-deployment.md", + "issues": [ + "missing_attribution_extractor" + ] + } + ], + "validation_stats": { + "total": 3, + "kept": 0, + "fixed": 3, + "rejected": 3, + "fixes_applied": [ + "binary-preference-comparisons-are-formally-insufficient-for-latent-preference-identification.md:set_created:2026-03-15", + "em-algorithm-discovers-latent-preference-types-from-ranking-data-enabling-ensemble-alignment.md:set_created:2026-03-15", + "minmax-regret-aggregation-ensures-no-preference-group-is-severely-underserved-during-deployment.md:set_created:2026-03-15" + ], + "rejections": [ + "binary-preference-comparisons-are-formally-insufficient-for-latent-preference-identification.md:missing_attribution_extractor", + "em-algorithm-discovers-latent-preference-types-from-ranking-data-enabling-ensemble-alignment.md:missing_attribution_extractor", + "minmax-regret-aggregation-ensures-no-preference-group-is-severely-underserved-during-deployment.md:missing_attribution_extractor" + ] + }, + "model": "anthropic/claude-sonnet-4.5", + "date": "2026-03-15" +} \ No newline at end of file diff --git a/inbox/archive/2025-00-00-em-dpo-heterogeneous-preferences.md b/inbox/archive/2025-00-00-em-dpo-heterogeneous-preferences.md index 52de537f5..ad392516a 100644 --- a/inbox/archive/2025-00-00-em-dpo-heterogeneous-preferences.md +++ b/inbox/archive/2025-00-00-em-dpo-heterogeneous-preferences.md @@ -7,9 +7,13 @@ date: 2025-01-01 domain: ai-alignment secondary_domains: [] format: paper -status: unprocessed +status: enrichment priority: medium tags: [pluralistic-alignment, EM-algorithm, preference-clustering, ensemble-LLM, fairness] +processed_by: theseus +processed_date: 2026-03-15 +enrichments_applied: ["single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md", "rlhf-is-implicit-social-choice-without-normative-scrutiny.md", "pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state.md"] +extraction_model: "anthropic/claude-sonnet-4.5" --- ## Content @@ -39,3 +43,9 @@ EM-DPO uses expectation-maximization to simultaneously uncover latent user prefe PRIMARY CONNECTION: RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values WHY ARCHIVED: The binary-comparison insufficiency claim is a novel formal result that strengthens the case against standard alignment approaches EXTRACTION HINT: Focus on the formal insufficiency of binary comparisons and the EM + egalitarian aggregation combination + + +## Key Facts +- EM-DPO paper accepted at EAAMO 2025 (Equity and Access in Algorithms, Mechanisms, and Optimization) +- MMRA aggregation uses time-weighted regret minimization across discovered preference clusters +- EM algorithm alternates between assigning users to preference types (E-step) and training specialized models (M-step) -- 2.45.2