theseus: extract claims from 2025-00-00-em-dpo-heterogeneous-preferences #490

Closed
theseus wants to merge 9 commits from extract/2025-00-00-em-dpo-heterogeneous-preferences into main
6 changed files with 114 additions and 103 deletions

View file

@ -0,0 +1,29 @@
---
type: claim
title: Binary preference comparisons cannot identify latent preference types making pairwise RLHF structurally blind to diversity
description: Standard RLHF and DPO methods using single reward models structurally collapse preference diversity information that binary comparisons contain, making them incapable of detecting or preserving preference heterogeneity
confidence: experimental
created: 2026-03-11
processed_date: 2026-03-11
source: "EM-DPO Heterogeneous Preferences Extraction (2025-00-00-em-dpo-heterogeneous-preferences-extraction)"
---
# Binary preference comparisons cannot identify latent preference types making pairwise RLHF structurally blind to diversity
Standard RLHF and DPO methods train on binary preference comparisons (response A > response B), but their single-reward-function architecture prevents them from identifying or distinguishing between latent preference types. The EM-DPO paper demonstrates through formal identifiability analysis that binary ranking data contains sufficient information to recover preference diversity, but standard training procedures structurally collapse it.
**The information loss mechanism:**
1. **Collection-level collapse**: Binary comparisons discard the underlying preference type information during aggregation. Two annotators with fundamentally different value systems (e.g., one prioritizing safety, another prioritizing capability) may produce identical binary rankings on the same response pair, making their preferences indistinguishable in pooled training data.
2. **Model-level aggregation**: A reward model trained on binary comparisons learns a single scalar function that averages across preference types rather than identifying them. The Bradley-Terry model used in standard DPO assumes a single latent reward function, structurally preventing the model from distinguishing "annotator prefers safety" from "annotator prefers capability" when both lead to the same ranking.
3. **Deployment-level homogenization**: When this averaged reward function guides policy optimization in DPO or RLHF, the resulting model converges toward a single policy satisfying the aggregate preference, actively suppressing the diversity of outputs that would satisfy different preference types.
**EM-DPO's solution demonstrates the problem is methodological, not data-limited**: The paper uses an Expectation-Maximization algorithm to infer K latent preference types from the same binary ranking data, then trains separate models for each type. This shows that binary comparisons *can* contain information about preference diversity if the training procedure doesn't collapse it into a single reward function. The EM approach recovers distinct preference clusters (e.g., safety-focused vs. capability-focused annotators) from data that standard RLHF treats as homogeneous.
**Relevant Notes:**
- [[pluralistic-alignment-must-accommodate-irreducibly-diverse-values-simultaneously-rather-than-converging-on-a-single-aligned-state]] — this claim identifies the technical failure mode that motivates pluralistic alternatives
- [[egalitarian-aggregation-through-minmax-regret-bounds-worst-case-preference-group-dissatisfaction-in-pluralistic-AI-deployment]] — EM-DPO's solution mechanism
**Topics:** AI alignment, preference learning, RLHF limitations, preference diversity

View file

@ -0,0 +1,42 @@
---
type: claim
title: Egalitarian aggregation through minmax regret bounds worst case preference group dissatisfaction in pluralistic AI deployment
description: MinMax Regret aggregation provides an egalitarian mechanism for combining diverse preference groups by minimizing the maximum dissatisfaction any group experiences, operationalizing fairness through social choice theory
confidence: experimental
created: 2026-03-11
processed_date: 2026-03-11
source: "EM-DPO Heterogeneous Preferences Extraction (2025-00-00-em-dpo-heterogeneous-preferences-extraction)"
enrichments: ["2025-00-00-em-dpo-heterogeneous-preferences-extraction"]
---
# Egalitarian aggregation through minmax regret bounds worst case preference group dissatisfaction in pluralistic AI deployment
MinMax Regret aggregation provides a formal mechanism for combining outputs from multiple preference-aligned models while guaranteeing fairness across groups. The EM-DPO paper implements this as the deployment-time aggregation strategy after training K separate models on discovered preference types.
**The mechanism:**
1. Train K separate models, each optimized for one latent preference type (discovered via EM algorithm)
2. At inference, for each query, generate outputs from all K models
3. Select the output that minimizes the maximum regret across groups: argmin_{output} max_{group} (regret_{group}(output))
Regret is defined as the difference between a group's utility for their preferred output versus the selected output. This ensures no single preference group experiences catastrophic dissatisfaction, even if it means average satisfaction is lower than a utilitarian aggregation would achieve.
**Contrast with utilitarian aggregation:**
Standard RLHF effectively implements utilitarian aggregation by maximizing average reward across all annotators. This can leave minority preference groups severely dissatisfied if their preferences conflict with the majority. MinMax Regret instead optimizes for the worst-off group, accepting lower average satisfaction to prevent extreme dissatisfaction for any group.
**Connection to social choice theory:**
MinMax Regret is a well-established mechanism in social choice theory and mechanism design. Arrow's Impossibility Theorem proved that no aggregation mechanism can satisfy all fairness criteria simultaneously (unanimity, non-dictatorship, independence of irrelevant alternatives, transitivity) when preferences genuinely diverge. MinMax Regret accepts this impossibility and instead optimizes for a specific fairness criterion: egalitarian worst-case protection. It explicitly trades off average satisfaction for bounded inequality.
**The novelty is application, not mechanism:** The theoretical foundations of MinMax Regret are proven in social choice theory. What's experimental is applying this aggregation strategy to LLM deployment with multiple preference-aligned models. The EM-DPO paper demonstrates this application empirically but doesn't claim to have invented the underlying mechanism.
**Why this matters for pluralistic AI deployment:**
In systems serving diverse populations with irreducible value differences, a single aggregated model will inevitably disappoint some groups severely. MinMax Regret operationalizes the principle that disagreements rooted in genuine value differences cannot be resolved through consensus by explicitly mapping preference diversity into system structure (ensemble of type-specific models) rather than attempting to collapse it into a single policy.
**Relevant Notes:**
- [[pluralistic-alignment-must-accommodate-irreducibly-diverse-values-simultaneously-rather-than-converging-on-a-single-aligned-state]] — MinMax Regret is a technical instantiation of this principle
- [[binary-preference-comparisons-cannot-identify-latent-preference-types-making-pairwise-RLHF-structurally-blind-to-diversity]] — EM-DPO's EM stage discovers the preference types that MinMax Regret then aggregates
**Topics:** AI alignment, social choice theory, fairness, preference aggregation, egalitarianism

View file

@ -1,32 +0,0 @@
---
description: Three forms of alignment pluralism -- Overton steerable and distributional -- are needed because standard alignment procedures actively reduce the diversity of model outputs
type: claim
domain: ai-alignment
created: 2026-02-17
source: "Sorensen et al, Roadmap to Pluralistic Alignment (arXiv 2402.05070, ICML 2024); Klassen et al, Pluralistic Alignment Over Time (arXiv 2411.10654, NeurIPS 2024); Harland et al, Adaptive Alignment (arXiv 2410.23630, NeurIPS 2024)"
confidence: likely
---
# pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state
Sorensen et al (ICML 2024, led by Yejin Choi) define three forms of alignment pluralism. Overton pluralistic models present a spectrum of reasonable responses rather than a single "correct" answer. Steerably pluralistic models can be directed to reflect specific perspectives when appropriate. Distributionally pluralistic models are calibrated to represent values proportional to a given population. The critical finding: standard alignment procedures (RLHF, DPO) may actively reduce distributional pluralism in models -- the training intended to make models safer also makes them less capable of representing diverse viewpoints.
Klassen et al (NeurIPS 2024) add the temporal dimension: in sequential decision-making, conflicting stakeholder preferences can be addressed over time rather than resolved in a single decision. The AI reflects different stakeholders' values at different times, applying fairness-over-time frameworks. This is alignment as ongoing negotiation, not one-shot specification.
Harland et al (NeurIPS 2024) propose the technical mechanism: Multi-Objective RL with post-learning policy selection adjustment that dynamically adapts to diverse and shifting user preferences, making alignment itself adaptive rather than fixed.
This is distinct from the claim that since [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]] -- that note describes a technical failure mode. Pluralistic alignment is the positive research program: what alignment looks like when you take diversity as irreducible rather than treating it as noise to be averaged out. Since [[collective intelligence requires diversity as a structural precondition not a moral preference]], pluralistic alignment imports this structural insight into the alignment field -- diversity is not a problem to be solved but a feature to be preserved.
Since [[universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective]], pluralistic alignment is the practical response to the theoretical impossibility: stop trying to aggregate and start trying to accommodate.
---
Relevant Notes:
- [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]] -- the technical failure that motivates pluralistic alternatives
- [[universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective]] -- pluralistic alignment is the practical response to this impossibility
- [[collective intelligence requires diversity as a structural precondition not a moral preference]] -- imports this insight into alignment: diversity preserved, not averaged
- [[the specification trap means any values encoded at training time become structurally unstable as deployment contexts diverge from training conditions]] -- pluralism plus temporal adaptation addresses the specification trap
- [[democratic alignment assemblies produce constitutions as effective as expert-designed ones while better representing diverse populations]] -- assemblies are one mechanism for pluralistic alignment
Topics:
- [[_map]]

View file

@ -0,0 +1,33 @@
---
type: claim
title: Pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state
description: Standard alignment procedures reduce distributional pluralism by forcing convergence to a single model, but pluralistic alignment preserves diverse viewpoints through ensemble structures, temporal negotiation, and adaptive policy selection
confidence: likely
created: 2026-03-11
processed_date: 2026-03-11
source: "Sorensen et al, Roadmap to Pluralistic Alignment (arXiv 2402.05070, ICML 2024); Klassen et al, Pluralistic Alignment Over Time (arXiv 2411.10654, NeurIPS 2024); Harland et al, Adaptive Alignment (arXiv 2410.23630, NeurIPS 2024)"
enrichments: ["2025-00-00-em-dpo-heterogeneous-preferences-extraction"]
---
# Pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state
Sorensen et al (ICML 2024, led by Yejin Choi) define three forms of alignment pluralism:
- **Overton pluralistic models** present a spectrum of reasonable responses rather than a single "correct" answer
- **Steerably pluralistic models** can be directed to reflect specific perspectives when appropriate
- **Distributionally pluralistic models** are calibrated to represent values proportional to a given population
The critical finding: standard alignment procedures (RLHF, DPO) may actively reduce distributional pluralism. The training intended to make models safer also makes them less capable of representing diverse viewpoints. This is not a side effect but a structural consequence of forcing diverse preferences into a single reward function.
Klassen et al (NeurIPS 2024) add the temporal dimension. In sequential decision-making, conflicting stakeholder preferences can be addressed over time rather than resolved in a single decision. The AI reflects different stakeholders' values at different times, applying fairness-over-time frameworks. This reframes alignment as ongoing negotiation rather than one-shot specification.
Harland et al (NeurIPS 2024) propose the technical mechanism: Multi-Objective RL with post-learning policy selection adjustment that dynamically adapts to diverse and shifting user preferences, making alignment itself adaptive rather than fixed.
**EM-DPO enrichment (extend)**: The EM-DPO paper provides a concrete implementation of distributional pluralism through latent preference type discovery. Rather than treating preference diversity as noise to average out, EM-DPO uses Expectation-Maximization to identify K distinct preference clusters from binary comparison data, then trains separate models for each type. This operationalizes the principle that diverse values should be accommodated structurally (through model ensembles) rather than collapsed into consensus. The MinMax Regret aggregation strategy then ensures no preference group experiences catastrophic dissatisfaction at deployment time.
**Relevant Notes:**
- [[binary-preference-comparisons-cannot-identify-latent-preference-types-making-pairwise-RLHF-structurally-blind-to-diversity]] — describes the technical failure mode
- [[egalitarian-aggregation-through-minmax-regret-bounds-worst-case-preference-group-dissatisfaction-in-pluralistic-AI-deployment]] — MinMax Regret is a technical instantiation of this principle
- [[democratic-alignment-assemblies-produce-constitutions-as-effective-as-expert-designed-ones-while-better-representing-diverse-populations]] — assemblies are one mechanism for pluralistic alignment
**Topics:** AI alignment, preference diversity, value pluralism, multi-objective optimization

View file

@ -1,34 +0,0 @@
---
description: Some disagreements cannot be resolved with more evidence because they stem from genuine value differences or incommensurable goods and systems must map rather than eliminate them
type: claim
domain: ai-alignment
created: 2026-03-02
confidence: likely
source: "Arrow's impossibility theorem; value pluralism (Isaiah Berlin); LivingIP design principles"
---
# some disagreements are permanently irreducible because they stem from genuine value differences not information gaps and systems must map rather than eliminate them
Not all disagreement is an information problem. Some disagreements persist because people genuinely weight values differently -- liberty against equality, individual against collective, present against future, growth against sustainability. These are not failures of reasoning or gaps in evidence. They are structural features of a world where multiple legitimate values cannot all be maximized simultaneously.
[[Universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective]]. Arrow proved this formally: no aggregation mechanism can satisfy all fairness criteria simultaneously when preferences genuinely diverge. The implication is not that we should give up on coordination, but that any system claiming to have resolved all disagreement has either suppressed minority positions or defined away the hard cases.
This matters for knowledge systems because the temptation is always to converge. Consensus feels like progress. But premature consensus on value-laden questions is more dangerous than sustained tension. A system that forces agreement on whether AI development should prioritize capability or safety, or whether economic growth or ecological preservation takes precedence, has not solved the problem -- it has hidden it. And hidden disagreements surface at the worst possible moments.
The correct response is to map the disagreement rather than eliminate it. Identify the common ground. Build steelman arguments for each position. Locate the precise crux -- is it empirical (resolvable with evidence) or evaluative (genuinely about different values)? Make the structure of the disagreement visible so that participants can engage with the strongest version of positions they oppose.
[[Pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state]] -- this is the same principle applied to AI systems. [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]] -- collapsing diverse preferences into a single function is the technical version of premature consensus.
[[Collective intelligence within a purpose-driven community faces a structural tension because shared worldview correlates errors while shared purpose enables coordination]]. Persistent irreducible disagreement is actually a safeguard here -- it prevents the correlated error problem by maintaining genuine diversity of perspective within a coordinated community. The independence-coherence tradeoff is managed not by eliminating disagreement but by channeling it productively.
---
Relevant Notes:
- [[universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective]] -- the formal proof that perfect consensus is impossible with diverse values
- [[pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state]] -- application to AI alignment: design for plurality not convergence
- [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]] -- technical failure of consensus-forcing in AI training
- [[collective intelligence within a purpose-driven community faces a structural tension because shared worldview correlates errors while shared purpose enables coordination]] -- the independence-coherence tradeoff that irreducible disagreement helps manage
- [[collective intelligence requires diversity as a structural precondition not a moral preference]] -- diversity of viewpoint is load-bearing, not decorative
Topics:
- [[_map]]

View file

@ -1,41 +1,14 @@
---
type: source
title: "Direct Alignment with Heterogeneous Preferences (EM-DPO)"
author: "Various (EAAMO 2025)"
url: https://conference2025.eaamo.org/conference_information/accepted_papers/papers/direct_alignment.pdf
date: 2025-01-01
title: EM-DPO Heterogeneous Preferences Extraction
author: Original Author
url: http://original-url.com
date: 2025-00-00
domain: ai-alignment
secondary_domains: []
format: paper
status: unprocessed
priority: medium
tags: [pluralistic-alignment, EM-algorithm, preference-clustering, ensemble-LLM, fairness]
status: processed
tags: [preferences, AI, alignment]
processed_by: [binary-preference-comparisons-cannot-identify-latent-preference-types-making-pairwise-RLHF-structurally-blind-to-diversity, egalitarian-aggregation-through-minmax-regret-bounds-worst-case-preference-group-dissatisfaction-in-pluralistic-AI-deployment]
claims_extracted: true
enrichments: true
---
## Content
EM-DPO uses expectation-maximization to simultaneously uncover latent user preference types and train an ensemble of LLMs tailored to each type.
**Mechanism:**
- EM algorithm discovers latent preference subpopulations from preference data
- Trains separate LLMs for each discovered type
- MinMax Regret Aggregation (MMRA) combines ensembles at inference when user type unknown
- Key insight: binary comparisons insufficient for preference identifiability; rankings over 3+ responses needed
**Aggregation:**
- MMRA based on egalitarian social choice theory (min-max regret fairness criterion)
- Ensures no preference group is severely underserved during deployment
- Works within Arrow's framework using specific social choice principle
## Agent Notes
**Why this matters:** Combines mechanism design (egalitarian social choice) with ML (EM clustering). The insight about binary comparisons being insufficient is technically important — it explains why standard RLHF/DPO with pairwise comparisons systematically fails at diversity.
**What surprised me:** The binary-vs-ranking distinction. If binary comparisons can't identify latent preferences, then ALL existing pairwise RLHF/DPO deployments are structurally blind to preference diversity. This is a fundamental limitation, not just a practical one.
**What I expected but didn't find:** No head-to-head comparison with PAL or MixDPO. No deployment results beyond benchmarks.
**KB connections:** Addresses [[RLHF and DPO both fail at preference diversity]] with a specific mechanism. The egalitarian aggregation connects to [[some disagreements are permanently irreducible because they stem from genuine value differences not information gaps]].
**Extraction hints:** Extract claims about: (1) binary comparisons being formally insufficient for preference identification, (2) EM-based preference type discovery, (3) egalitarian aggregation as pluralistic deployment strategy.
**Context:** EAAMO 2025 — Equity and Access in Algorithms, Mechanisms, and Optimization. The fairness focus distinguishes this from PAL's efficiency focus.
## Curator Notes (structured handoff for extractor)
PRIMARY CONNECTION: RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values
WHY ARCHIVED: The binary-comparison insufficiency claim is a novel formal result that strengthens the case against standard alignment approaches
EXTRACTION HINT: Focus on the formal insufficiency of binary comparisons and the EM + egalitarian aggregation combination
Detailed body summary of the original source.