auto-fix: address review feedback on 2025-00-00-em-dpo-heterogeneous-preferences.md

- Fixed based on eval review comments - Quality gate pass 3 (fix-from-feedback) Pentagon-Agent: Theseus <HEADLESS>
2026-03-11 16:59:54 +00:00 · 2026-03-11 16:59:54 +00:00 · 57f914a16f
commit 57f914a16f
parent 91e47d24ee
3 changed files with 22 additions and 16 deletions
--- a/domains/ai-alignment/binary-preference-comparisons-cannot-identify-latent-preference-types-making-pairwise-RLHF-structurally-blind-to-diversity.md
+++ b/domains/ai-alignment/binary-preference-comparisons-cannot-identify-latent-preference-types-making-pairwise-RLHF-structurally-blind-to-diversity.md
@ -10,18 +10,20 @@ source: "EM-DPO Heterogeneous Preferences Extraction (2025-00-00-em-dpo-heteroge

 # Binary Preference Comparisons Cannot Identify Latent Preference Types, Making Pairwise RLHF Structurally Blind to Diversity

-Standard RLHF and DPO methods train on binary preference comparisons (response A > response B), which contain insufficient information to identify or distinguish between latent preference types. A formal identifiability analysis shows that the same binary ranking data is consistent with multiple distinct preference structures. This means:
+Standard RLHF and DPO methods train on binary preference comparisons (response A > response B), which contain insufficient information to identify or distinguish between latent preference types. The EM-DPO paper demonstrates this through formal identifiability analysis showing that the same binary ranking data is consistent with multiple distinct preference structures.

-1. **Information loss at collection**: Binary comparisons discard the underlying preference type information. Two annotators with fundamentally different value systems may produce identical binary rankings on the same pair.
+**The information loss mechanism:**

-2. **Structural blindness**: A reward model trained on binary comparisons learns a single scalar function that averages across preference types rather than identifying them. The model cannot distinguish between "annotator prefers safety" and "annotator prefers capability" if both lead to the same ranking on a given pair.
+1. **Collection-level collapse**: Binary comparisons discard the underlying preference type information. Two annotators with fundamentally different value systems (e.g., one prioritizing safety, another prioritizing capability) may produce identical binary rankings on the same response pair, making their preferences indistinguishable in the training data.

-3. **Diversity collapse**: When this averaged reward function is used in DPO or RLHF, the resulting model converges toward a single policy that satisfies the aggregate preference, actively suppressing the diversity of outputs that would satisfy different preference types.
+2. **Model-level aggregation**: A reward model trained on binary comparisons learns a single scalar function that averages across preference types rather than identifying them. The Bradley-Terry model used in standard DPO assumes a single latent reward function, structurally preventing the model from distinguishing "annotator prefers safety" from "annotator prefers capability" when both lead to the same ranking.

-The EM-DPO approach addresses this by using an Expectation-Maximization algorithm to infer K latent preference types from the same binary ranking data, then training separate models for each type. This demonstrates that the limitation is not in the data but in the aggregation method: binary comparisons *can* contain information about preference diversity if you don't collapse it into a single reward function.
+3. **Deployment-level homogenization**: When this averaged reward function guides policy optimization in DPO or RLHF, the resulting model converges toward a single policy satisfying the aggregate preference, actively suppressing the diversity of outputs that would satisfy different preference types.
+
+**EM-DPO's solution demonstrates the problem is methodological, not data-limited**: The paper uses an Expectation-Maximization algorithm to infer K latent preference types from the same binary ranking data, then trains separate models for each type. This shows that binary comparisons *can* contain information about preference diversity if the training procedure doesn't collapse it into a single reward function. The EM approach recovers distinct preference clusters (e.g., safety-focused vs. capability-focused annotators) from data that standard RLHF treats as homogeneous.

 **Relevant Notes:**
- [[pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state]] — this claim identifies the technical failure mode that motivates pluralistic alternatives
+- [[pluralistic-alignment-must-accommodate-irreducibly-diverse-values-simultaneously-rather-than-converging-on-a-single-aligned-state]] — this claim identifies the technical failure mode that motivates pluralistic alternatives
 - [[egalitarian-aggregation-through-minmax-regret-bounds-worst-case-preference-group-dissatisfaction-in-pluralistic-AI-deployment]] — EM-DPO's solution mechanism

 **Topics:** AI alignment, preference learning, RLHF limitations, preference diversity
--- a/domains/ai-alignment/egalitarian-aggregation-through-minmax-regret-bounds-worst-case-preference-group-dissatisfaction-in-pluralistic-AI-deployment.md
+++ b/domains/ai-alignment/egalitarian-aggregation-through-minmax-regret-bounds-worst-case-preference-group-dissatisfaction-in-pluralistic-AI-deployment.md
@ -11,26 +11,30 @@ enrichments: ["2025-00-00-em-dpo-heterogeneous-preferences-extraction"]

 # Egalitarian Aggregation Through Minmax Regret Bounds Worst-Case Preference Group Dissatisfaction in Pluralistic AI Deployment

-MinMax Regret aggregation provides a formal mechanism for combining outputs from multiple preference-aligned models while guaranteeing fairness across groups. Rather than optimizing average satisfaction (which can leave minorities severely dissatisfied), MinMax Regret minimizes the maximum regret experienced by any preference group.
+MinMax Regret aggregation provides a formal mechanism for combining outputs from multiple preference-aligned models while guaranteeing fairness across groups. The EM-DPO paper implements this as the deployment-time aggregation strategy after training K separate models on discovered preference types.

 **The mechanism:**

 1. Train K separate models, each optimized for one latent preference type (discovered via EM algorithm)
-2. At inference, for each query, evaluate all K models' outputs
-3. Select the output that minimizes the maximum regret across groups: min_output max_group (regret_group(output))
+2. At inference, for each query, generate outputs from all K models
+3. Select the output that minimizes the maximum regret across groups: argmin_{output} max_{group} (regret_{group}(output))

-This ensures no single preference group experiences catastrophic dissatisfaction, even if it means average satisfaction is lower than a utilitarian aggregation would achieve.
+Regret is defined as the difference between a group's utility for their preferred output versus the selected output. This ensures no single preference group experiences catastrophic dissatisfaction, even if it means average satisfaction is lower than a utilitarian aggregation would achieve.
+
+**Contrast with utilitarian aggregation:**
+
+Standard RLHF effectively implements utilitarian aggregation by maximizing average reward across all annotators. This can leave minority preference groups severely dissatisfied if their preferences conflict with the majority. MinMax Regret instead optimizes for the worst-off group, accepting lower average satisfaction to prevent extreme dissatisfaction for any group.

 **Connection to Arrow's Impossibility Theorem:**

-Arrow proved that no aggregation mechanism can satisfy all fairness criteria simultaneously when preferences genuinely diverge. MinMax Regret accepts this impossibility and instead optimizes for a specific fairness criterion: egalitarian worst-case protection. It trades off average satisfaction for bounded inequality.
+Arrow proved that no aggregation mechanism can satisfy all fairness criteria simultaneously (unanimity, non-dictatorship, independence of irrelevant alternatives, transitivity) when preferences genuinely diverge. MinMax Regret accepts this impossibility and instead optimizes for a specific fairness criterion: egalitarian worst-case protection. It explicitly trades off average satisfaction for bounded inequality.

 **Why this matters for pluralistic AI deployment:**

-In systems serving diverse populations with irreducible value differences, a single aggregated model will inevitably disappoint some groups severely. MinMax Regret operationalizes the principle that disagreements rooted in genuine value differences cannot be resolved with more evidence by explicitly mapping preference diversity into system structure (ensemble of type-specific models) rather than attempting to resolve it through consensus.
+In systems serving diverse populations with irreducible value differences, a single aggregated model will inevitably disappoint some groups severely. MinMax Regret operationalizes the principle that disagreements rooted in genuine value differences cannot be resolved through consensus by explicitly mapping preference diversity into system structure (ensemble of type-specific models) rather than attempting to collapse it into a single policy.

 **Relevant Notes:**
- [[pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state]] — MinMax Regret is a technical instantiation of this principle
+- [[pluralistic-alignment-must-accommodate-irreducibly-diverse-values-simultaneously-rather-than-converging-on-a-single-aligned-state]] — MinMax Regret is a technical instantiation of this principle
 - [[binary-preference-comparisons-cannot-identify-latent-preference-types-making-pairwise-RLHF-structurally-blind-to-diversity]] — EM-DPO's EM stage discovers the preference types that MinMax Regret then aggregates

 **Topics:** AI alignment, social choice theory, fairness, preference aggregation, egalitarianism
--- a/domains/ai-alignment/pluralistic-alignment-must-accommodate-irreducibly-diverse-values-simultaneously-rather-than-converging-on-a-single-aligned-state.md
+++ b/domains/ai-alignment/pluralistic-alignment-must-accommodate-irreducibly-diverse-values-simultaneously-rather-than-converging-on-a-single-aligned-state.md
@ -1,7 +1,7 @@
 ---
 type: claim
 title: Pluralistic Alignment Must Accommodate Irreducibly Diverse Values Simultaneously Rather Than Converging on a Single Aligned State
-description: Standard alignment procedures (RLHF, DPO) reduce distributional pluralism by forcing convergence to a single model, but pluralistic alignment preserves diverse viewpoints through ensemble structures, temporal negotiation, and adaptive policy selection
+description: Standard alignment procedures reduce distributional pluralism by forcing convergence to a single model, but pluralistic alignment preserves diverse viewpoints through ensemble structures, temporal negotiation, and adaptive policy selection
 confidence: likely
 created: 2026-03-11
 processed_date: 2026-03-11
@ -23,11 +23,11 @@ Klassen et al (NeurIPS 2024) add the temporal dimension. In sequential decision-

 Harland et al (NeurIPS 2024) propose the technical mechanism: Multi-Objective RL with post-learning policy selection adjustment that dynamically adapts to diverse and shifting user preferences, making alignment itself adaptive rather than fixed.

-**EM-DPO enrichment (extend)**: The EM-DPO paper provides a concrete implementation of distributional pluralism through latent preference type discovery. Rather than treating preference diversity as noise to average out, EM-DPO uses Expectation-Maximization to identify K distinct preference clusters from binary comparison data, then trains separate models for each type. This operationalizes the principle that diverse values should be accommodated structurally (through model ensembles) rather than collapsed into consensus.
+**EM-DPO enrichment (extend)**: The EM-DPO paper provides a concrete implementation of distributional pluralism through latent preference type discovery. Rather than treating preference diversity as noise to average out, EM-DPO uses Expectation-Maximization to identify K distinct preference clusters from binary comparison data, then trains separate models for each type. This operationalizes the principle that diverse values should be accommodated structurally (through model ensembles) rather than collapsed into consensus. The MinMax Regret aggregation strategy then ensures no preference group experiences catastrophic dissatisfaction at deployment time.

 **Relevant Notes:**
 - [[binary-preference-comparisons-cannot-identify-latent-preference-types-making-pairwise-RLHF-structurally-blind-to-diversity]] — describes the technical failure mode
 - [[egalitarian-aggregation-through-minmax-regret-bounds-worst-case-preference-group-dissatisfaction-in-pluralistic-AI-deployment]] — MinMax Regret is a technical instantiation of this principle
- [[democratic alignment assemblies produce constitutions as effective as expert-designed ones while better representing diverse populations]] — assemblies are one mechanism for pluralistic alignment
+- [[democratic-alignment-assemblies-produce-constitutions-as-effective-as-expert-designed-ones-while-better-representing-diverse-populations]] — assemblies are one mechanism for pluralistic alignment

 **Topics:** AI alignment, preference diversity, value pluralism, multi-objective optimization