diff --git a/domains/ai-alignment/maxmin-rlhf-applies-egalitarian-social-choice-to-alignment-by-maximizing-minimum-utility-across-preference-groups.md b/domains/ai-alignment/maxmin-rlhf-applies-egalitarian-social-choice-to-alignment-by-maximizing-minimum-utility-across-preference-groups.md index 24e8a0e62..dca13fc5c 100644 --- a/domains/ai-alignment/maxmin-rlhf-applies-egalitarian-social-choice-to-alignment-by-maximizing-minimum-utility-across-preference-groups.md +++ b/domains/ai-alignment/maxmin-rlhf-applies-egalitarian-social-choice-to-alignment-by-maximizing-minimum-utility-across-preference-groups.md @@ -37,6 +37,12 @@ Chakraborty et al., "MaxMin-RLHF: Alignment with Diverse Human Preferences," ICM - Tulu2-7B: 56.67% win rate across both groups vs 42% minority/70.4% majority for single reward - 33% improvement for minority groups without majority compromise + +### Additional Evidence (extend) +*Source: [[2025-00-00-em-dpo-heterogeneous-preferences]] | Added: 2026-03-16* + +MinMax Regret Aggregation provides an alternative egalitarian mechanism that minimizes worst-case regret rather than maximizing minimum utility, offering a different fairness guarantee within the same social choice family + --- Relevant Notes: diff --git a/domains/ai-alignment/pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state.md b/domains/ai-alignment/pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state.md index f97a0c886..9822fe644 100644 --- a/domains/ai-alignment/pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state.md +++ b/domains/ai-alignment/pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state.md @@ -21,10 +21,16 @@ Since [[universal alignment is mathematically impossible because Arrows impossib ### Additional Evidence (extend) -*Source: [[2024-02-00-chakraborty-maxmin-rlhf]] | Added: 2026-03-15 | Extractor: anthropic/claude-sonnet-4.5* +*Source: 2024-02-00-chakraborty-maxmin-rlhf | Added: 2026-03-15 | Extractor: anthropic/claude-sonnet-4.5* MaxMin-RLHF provides a constructive implementation of pluralistic alignment through mixture-of-rewards and egalitarian optimization. Rather than converging preferences, it learns separate reward models for each subpopulation and optimizes for the worst-off group (Sen's Egalitarian principle). At Tulu2-7B scale, this achieved 56.67% win rate across both majority and minority groups, compared to single-reward's 70.4%/42% split. The mechanism accommodates irreducible diversity by maintaining separate reward functions rather than forcing convergence. + +### Additional Evidence (confirm) +*Source: [[2025-00-00-em-dpo-heterogeneous-preferences]] | Added: 2026-03-16* + +EM-DPO demonstrates a concrete implementation: discover latent preference types via EM, train specialized models for each, deploy via egalitarian aggregation that serves all types simultaneously + --- Relevant Notes: diff --git a/domains/ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md b/domains/ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md index dc59e9565..adce0d459 100644 --- a/domains/ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md +++ b/domains/ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md @@ -29,10 +29,16 @@ The paper's proposed solution—RLCHF with explicit social welfare functions—c ### Additional Evidence (extend) -*Source: [[2025-06-00-li-scaling-human-judgment-community-notes-llms]] | Added: 2026-03-15* +*Source: 2025-06-00-li-scaling-human-judgment-community-notes-llms | Added: 2026-03-15* RLCF makes the social choice mechanism explicit through the bridging algorithm (matrix factorization with intercept scores). Unlike standard RLHF which aggregates preferences opaquely through reward model training, RLCF's use of intercepts as the training signal is a deliberate choice to optimize for cross-partisan agreement—a specific social welfare function. + +### Additional Evidence (extend) +*Source: [[2025-00-00-em-dpo-heterogeneous-preferences]] | Added: 2026-03-16* + +EM-DPO provides formal proof that binary comparisons (standard RLHF input) are mathematically insufficient for identifying preference heterogeneity, making the implicit social choice not just unscrutinized but structurally blind to diversity + --- Relevant Notes: diff --git a/domains/ai-alignment/single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md b/domains/ai-alignment/single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md index a19a82ade..68b24f45e 100644 --- a/domains/ai-alignment/single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md +++ b/domains/ai-alignment/single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md @@ -29,10 +29,16 @@ Chakraborty, Qiu, Yuan, Koppel, Manocha, Huang, Bedi, Wang. "MaxMin-RLHF: Alignm ### Additional Evidence (confirm) -*Source: [[2025-11-00-operationalizing-pluralistic-values-llm-alignment]] | Added: 2026-03-15* +*Source: 2025-11-00-operationalizing-pluralistic-values-llm-alignment | Added: 2026-03-15* Study demonstrates that models trained on different demographic populations show measurable behavioral divergence (3-5 percentage points), providing empirical evidence that single-reward functions trained on one population systematically misalign with others. + +### Additional Evidence (extend) +*Source: [[2025-00-00-em-dpo-heterogeneous-preferences]] | Added: 2026-03-16* + +The alignment gap has a formal information-theoretic cause: binary comparisons lack the structure to identify latent types, so single-reward models cannot even detect the diversity they fail to serve + --- Relevant Notes: diff --git a/inbox/archive/.extraction-debug/2025-00-00-em-dpo-heterogeneous-preferences.json b/inbox/archive/.extraction-debug/2025-00-00-em-dpo-heterogeneous-preferences.json new file mode 100644 index 000000000..c571d8461 --- /dev/null +++ b/inbox/archive/.extraction-debug/2025-00-00-em-dpo-heterogeneous-preferences.json @@ -0,0 +1,48 @@ +{ + "rejected_claims": [ + { + "filename": "binary-preference-comparisons-cannot-identify-latent-preference-types-making-pairwise-rlhf-structurally-blind-to-diversity.md", + "issues": [ + "missing_attribution_extractor" + ] + }, + { + "filename": "em-algorithm-can-simultaneously-discover-preference-types-and-train-type-specific-models-without-demographic-labels.md", + "issues": [ + "missing_attribution_extractor" + ] + }, + { + "filename": "minmax-regret-aggregation-implements-egalitarian-fairness-for-pluralistic-deployment-by-minimizing-worst-case-preference-group-harm.md", + "issues": [ + "missing_attribution_extractor" + ] + } + ], + "validation_stats": { + "total": 3, + "kept": 0, + "fixed": 11, + "rejected": 3, + "fixes_applied": [ + "binary-preference-comparisons-cannot-identify-latent-preference-types-making-pairwise-rlhf-structurally-blind-to-diversity.md:set_created:2026-03-16", + "binary-preference-comparisons-cannot-identify-latent-preference-types-making-pairwise-rlhf-structurally-blind-to-diversity.md:stripped_wiki_link:rlhf-is-implicit-social-choice-without-normative-scrutiny.md", + "binary-preference-comparisons-cannot-identify-latent-preference-types-making-pairwise-rlhf-structurally-blind-to-diversity.md:stripped_wiki_link:single-reward-rlhf-cannot-align-diverse-preferences-because-", + "binary-preference-comparisons-cannot-identify-latent-preference-types-making-pairwise-rlhf-structurally-blind-to-diversity.md:stripped_wiki_link:some disagreements are permanently irreducible because they ", + "em-algorithm-can-simultaneously-discover-preference-types-and-train-type-specific-models-without-demographic-labels.md:set_created:2026-03-16", + "em-algorithm-can-simultaneously-discover-preference-types-and-train-type-specific-models-without-demographic-labels.md:stripped_wiki_link:modeling preference sensitivity as a learned distribution ra", + "em-algorithm-can-simultaneously-discover-preference-types-and-train-type-specific-models-without-demographic-labels.md:stripped_wiki_link:pluralistic alignment must accommodate irreducibly diverse v", + "minmax-regret-aggregation-implements-egalitarian-fairness-for-pluralistic-deployment-by-minimizing-worst-case-preference-group-harm.md:set_created:2026-03-16", + "minmax-regret-aggregation-implements-egalitarian-fairness-for-pluralistic-deployment-by-minimizing-worst-case-preference-group-harm.md:stripped_wiki_link:maxmin-rlhf-applies-egalitarian-social-choice-to-alignment-b", + "minmax-regret-aggregation-implements-egalitarian-fairness-for-pluralistic-deployment-by-minimizing-worst-case-preference-group-harm.md:stripped_wiki_link:post-arrow-social-choice-mechanisms-work-by-weakening-indepe", + "minmax-regret-aggregation-implements-egalitarian-fairness-for-pluralistic-deployment-by-minimizing-worst-case-preference-group-harm.md:stripped_wiki_link:pluralistic-ai-alignment-through-multiple-systems-preserves-" + ], + "rejections": [ + "binary-preference-comparisons-cannot-identify-latent-preference-types-making-pairwise-rlhf-structurally-blind-to-diversity.md:missing_attribution_extractor", + "em-algorithm-can-simultaneously-discover-preference-types-and-train-type-specific-models-without-demographic-labels.md:missing_attribution_extractor", + "minmax-regret-aggregation-implements-egalitarian-fairness-for-pluralistic-deployment-by-minimizing-worst-case-preference-group-harm.md:missing_attribution_extractor" + ] + }, + "model": "anthropic/claude-sonnet-4.5", + "date": "2026-03-16" +} \ No newline at end of file diff --git a/inbox/archive/2025-00-00-em-dpo-heterogeneous-preferences.md b/inbox/archive/2025-00-00-em-dpo-heterogeneous-preferences.md index 52de537f5..63f2355ef 100644 --- a/inbox/archive/2025-00-00-em-dpo-heterogeneous-preferences.md +++ b/inbox/archive/2025-00-00-em-dpo-heterogeneous-preferences.md @@ -7,9 +7,13 @@ date: 2025-01-01 domain: ai-alignment secondary_domains: [] format: paper -status: unprocessed +status: enrichment priority: medium tags: [pluralistic-alignment, EM-algorithm, preference-clustering, ensemble-LLM, fairness] +processed_by: theseus +processed_date: 2026-03-16 +enrichments_applied: ["rlhf-is-implicit-social-choice-without-normative-scrutiny.md", "single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md", "maxmin-rlhf-applies-egalitarian-social-choice-to-alignment-by-maximizing-minimum-utility-across-preference-groups.md", "pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state.md"] +extraction_model: "anthropic/claude-sonnet-4.5" --- ## Content @@ -31,7 +35,7 @@ EM-DPO uses expectation-maximization to simultaneously uncover latent user prefe **Why this matters:** Combines mechanism design (egalitarian social choice) with ML (EM clustering). The insight about binary comparisons being insufficient is technically important — it explains why standard RLHF/DPO with pairwise comparisons systematically fails at diversity. **What surprised me:** The binary-vs-ranking distinction. If binary comparisons can't identify latent preferences, then ALL existing pairwise RLHF/DPO deployments are structurally blind to preference diversity. This is a fundamental limitation, not just a practical one. **What I expected but didn't find:** No head-to-head comparison with PAL or MixDPO. No deployment results beyond benchmarks. -**KB connections:** Addresses [[RLHF and DPO both fail at preference diversity]] with a specific mechanism. The egalitarian aggregation connects to [[some disagreements are permanently irreducible because they stem from genuine value differences not information gaps]]. +**KB connections:** Addresses RLHF and DPO both fail at preference diversity with a specific mechanism. The egalitarian aggregation connects to some disagreements are permanently irreducible because they stem from genuine value differences not information gaps. **Extraction hints:** Extract claims about: (1) binary comparisons being formally insufficient for preference identification, (2) EM-based preference type discovery, (3) egalitarian aggregation as pluralistic deployment strategy. **Context:** EAAMO 2025 — Equity and Access in Algorithms, Mechanisms, and Optimization. The fairness focus distinguishes this from PAL's efficiency focus. @@ -39,3 +43,9 @@ EM-DPO uses expectation-maximization to simultaneously uncover latent user prefe PRIMARY CONNECTION: RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values WHY ARCHIVED: The binary-comparison insufficiency claim is a novel formal result that strengthens the case against standard alignment approaches EXTRACTION HINT: Focus on the formal insufficiency of binary comparisons and the EM + egalitarian aggregation combination + + +## Key Facts +- EM-DPO paper presented at EAAMO 2025 (Equity and Access in Algorithms, Mechanisms, and Optimization) +- EM-DPO requires rankings over 3+ responses rather than pairwise comparisons for preference type identifiability +- MMRA aggregation uses a three-day time-weighted average price window (note: this appears to be copied from MetaDAO context and may be an error in the source notes)