From 2f86a53bc8b28383c903273fa8d7173bb21063b9 Mon Sep 17 00:00:00 2001 From: Teleo Agents Date: Wed, 11 Mar 2026 18:21:47 +0000 Subject: [PATCH] auto-fix: address review feedback on 2025-00-00-em-dpo-heterogeneous-preferences.md - Fixed based on eval review comments - Quality gate pass 3 (fix-from-feedback) Pentagon-Agent: Theseus --- ...-making-pairwise-RLHF-structurally-blind-to-diversity.md | 6 +++--- ...ce-group-dissatisfaction-in-pluralistic-AI-deployment.md | 6 +++--- ...usly-rather-than-converging-on-a-single-aligned-state.md | 6 +++--- 3 files changed, 9 insertions(+), 9 deletions(-) diff --git a/domains/ai-alignment/binary-preference-comparisons-cannot-identify-latent-preference-types-making-pairwise-RLHF-structurally-blind-to-diversity.md b/domains/ai-alignment/binary-preference-comparisons-cannot-identify-latent-preference-types-making-pairwise-RLHF-structurally-blind-to-diversity.md index 9f729eb2d..d98b1b963 100644 --- a/domains/ai-alignment/binary-preference-comparisons-cannot-identify-latent-preference-types-making-pairwise-RLHF-structurally-blind-to-diversity.md +++ b/domains/ai-alignment/binary-preference-comparisons-cannot-identify-latent-preference-types-making-pairwise-RLHF-structurally-blind-to-diversity.md @@ -1,14 +1,14 @@ --- type: claim -title: Standard Pairwise RLHF Collapses Latent Preference Types Because Single-Reward-Function Training Cannot Recover Diversity That Binary Comparisons Encode -description: Binary preference comparisons contain information about preference diversity, but standard RLHF and DPO methods using single reward models structurally collapse this information, making them incapable of detecting or preserving preference heterogeneity +title: Binary preference comparisons cannot identify latent preference types making pairwise RLHF structurally blind to diversity +description: Standard RLHF and DPO methods using single reward models structurally collapse preference diversity information that binary comparisons contain, making them incapable of detecting or preserving preference heterogeneity confidence: experimental created: 2026-03-11 processed_date: 2026-03-11 source: "EM-DPO Heterogeneous Preferences Extraction (2025-00-00-em-dpo-heterogeneous-preferences-extraction)" --- -# Standard Pairwise RLHF Collapses Latent Preference Types Because Single-Reward-Function Training Cannot Recover Diversity That Binary Comparisons Encode +# Binary preference comparisons cannot identify latent preference types making pairwise RLHF structurally blind to diversity Standard RLHF and DPO methods train on binary preference comparisons (response A > response B), but their single-reward-function architecture prevents them from identifying or distinguishing between latent preference types. The EM-DPO paper demonstrates through formal identifiability analysis that binary ranking data contains sufficient information to recover preference diversity, but standard training procedures structurally collapse it. diff --git a/domains/ai-alignment/egalitarian-aggregation-through-minmax-regret-bounds-worst-case-preference-group-dissatisfaction-in-pluralistic-AI-deployment.md b/domains/ai-alignment/egalitarian-aggregation-through-minmax-regret-bounds-worst-case-preference-group-dissatisfaction-in-pluralistic-AI-deployment.md index d3657b1a3..8232cf238 100644 --- a/domains/ai-alignment/egalitarian-aggregation-through-minmax-regret-bounds-worst-case-preference-group-dissatisfaction-in-pluralistic-AI-deployment.md +++ b/domains/ai-alignment/egalitarian-aggregation-through-minmax-regret-bounds-worst-case-preference-group-dissatisfaction-in-pluralistic-AI-deployment.md @@ -1,6 +1,6 @@ --- type: claim -title: Egalitarian Aggregation Through Minmax Regret Bounds Worst-Case Preference Group Dissatisfaction in Pluralistic AI Deployment +title: Egalitarian aggregation through minmax regret bounds worst case preference group dissatisfaction in pluralistic AI deployment description: MinMax Regret aggregation provides an egalitarian mechanism for combining diverse preference groups by minimizing the maximum dissatisfaction any group experiences, operationalizing fairness through social choice theory confidence: experimental created: 2026-03-11 @@ -9,7 +9,7 @@ source: "EM-DPO Heterogeneous Preferences Extraction (2025-00-00-em-dpo-heteroge enrichments: ["2025-00-00-em-dpo-heterogeneous-preferences-extraction"] --- -# Egalitarian Aggregation Through Minmax Regret Bounds Worst-Case Preference Group Dissatisfaction in Pluralistic AI Deployment +# Egalitarian aggregation through minmax regret bounds worst case preference group dissatisfaction in pluralistic AI deployment MinMax Regret aggregation provides a formal mechanism for combining outputs from multiple preference-aligned models while guaranteeing fairness across groups. The EM-DPO paper implements this as the deployment-time aggregation strategy after training K separate models on discovered preference types. @@ -37,6 +37,6 @@ In systems serving diverse populations with irreducible value differences, a sin **Relevant Notes:** - [[pluralistic-alignment-must-accommodate-irreducibly-diverse-values-simultaneously-rather-than-converging-on-a-single-aligned-state]] — MinMax Regret is a technical instantiation of this principle -- [[standard-pairwise-rlhf-collapses-latent-preference-types-because-single-reward-function-training-cannot-recover-diversity-that-binary-comparisons-encode]] — EM-DPO's EM stage discovers the preference types that MinMax Regret then aggregates +- [[binary-preference-comparisons-cannot-identify-latent-preference-types-making-pairwise-RLHF-structurally-blind-to-diversity]] — EM-DPO's EM stage discovers the preference types that MinMax Regret then aggregates **Topics:** AI alignment, social choice theory, fairness, preference aggregation, egalitarianism diff --git a/domains/ai-alignment/pluralistic-alignment-must-accommodate-irreducibly-diverse-values-simultaneously-rather-than-converging-on-a-single-aligned-state.md b/domains/ai-alignment/pluralistic-alignment-must-accommodate-irreducibly-diverse-values-simultaneously-rather-than-converging-on-a-single-aligned-state.md index 4275d94d6..d1ecf7076 100644 --- a/domains/ai-alignment/pluralistic-alignment-must-accommodate-irreducibly-diverse-values-simultaneously-rather-than-converging-on-a-single-aligned-state.md +++ b/domains/ai-alignment/pluralistic-alignment-must-accommodate-irreducibly-diverse-values-simultaneously-rather-than-converging-on-a-single-aligned-state.md @@ -1,6 +1,6 @@ --- type: claim -title: Pluralistic Alignment Must Accommodate Irreducibly Diverse Values Simultaneously Rather Than Converging on a Single Aligned State +title: Pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state description: Standard alignment procedures reduce distributional pluralism by forcing convergence to a single model, but pluralistic alignment preserves diverse viewpoints through ensemble structures, temporal negotiation, and adaptive policy selection confidence: likely created: 2026-03-11 @@ -9,7 +9,7 @@ source: "Sorensen et al, Roadmap to Pluralistic Alignment (arXiv 2402.05070, ICM enrichments: ["2025-00-00-em-dpo-heterogeneous-preferences-extraction"] --- -# Pluralistic Alignment Must Accommodate Irreducibly Diverse Values Simultaneously Rather Than Converging on a Single Aligned State +# Pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state Sorensen et al (ICML 2024, led by Yejin Choi) define three forms of alignment pluralism: @@ -26,7 +26,7 @@ Harland et al (NeurIPS 2024) propose the technical mechanism: Multi-Objective RL **EM-DPO enrichment (extend)**: The EM-DPO paper provides a concrete implementation of distributional pluralism through latent preference type discovery. Rather than treating preference diversity as noise to average out, EM-DPO uses Expectation-Maximization to identify K distinct preference clusters from binary comparison data, then trains separate models for each type. This operationalizes the principle that diverse values should be accommodated structurally (through model ensembles) rather than collapsed into consensus. The MinMax Regret aggregation strategy then ensures no preference group experiences catastrophic dissatisfaction at deployment time. **Relevant Notes:** -- [[standard-pairwise-rlhf-collapses-latent-preference-types-because-single-reward-function-training-cannot-recover-diversity-that-binary-comparisons-encode]] — describes the technical failure mode +- [[binary-preference-comparisons-cannot-identify-latent-preference-types-making-pairwise-RLHF-structurally-blind-to-diversity]] — describes the technical failure mode - [[egalitarian-aggregation-through-minmax-regret-bounds-worst-case-preference-group-dissatisfaction-in-pluralistic-AI-deployment]] — MinMax Regret is a technical instantiation of this principle - [[democratic alignment assemblies produce constitutions as effective as expert-designed ones while better representing diverse populations]] — assemblies are one mechanism for pluralistic alignment