theseus: extract claims from 2025-00-00-em-dpo-heterogeneous-preferences.md

- Source: inbox/archive/2025-00-00-em-dpo-heterogeneous-preferences.md
- Domain: ai-alignment
- Extracted by: headless extraction cron (worker 5)

Pentagon-Agent: Theseus <HEADLESS>
This commit is contained in:
Teleo Agents 2026-03-11 09:25:43 +00:00
parent db497155d8
commit 65615aa04c
5 changed files with 102 additions and 1 deletions

View file

@ -0,0 +1,39 @@
---
type: claim
domain: ai-alignment
description: "Binary preference comparisons lack the information-theoretic capacity to identify latent user preference subpopulations; rankings over 3+ responses are required"
confidence: experimental
source: "EM-DPO paper (EAAMO 2025) — formal identifiability analysis"
created: 2025-01-16
---
# Binary preference comparisons cannot identify latent preference types, making pairwise RLHF structurally blind to diversity
The EM-DPO paper presents a formal identifiability analysis demonstrating that binary preference comparisons—the standard data format for RLHF and DPO training—are mathematically insufficient to discover latent user preference subpopulations. The mechanism requires rankings over 3 or more responses to uncover heterogeneous preference types from preference data.
## Information-Theoretic Constraint
This is not a practical limitation that better algorithms could overcome—it is a fundamental information-theoretic constraint. Binary comparisons simply do not contain enough information to distinguish between two scenarios:
1. All users share similar preferences that produce consistent pairwise choices
2. Users have genuinely diverse preferences that happen to produce similar pairwise rankings
The EM algorithm's identifiability proof formalizes this gap: pairwise data cannot resolve this ambiguity, but ranking data over 3+ responses can.
## Structural Blindness in Deployed Systems
This means every existing pairwise RLHF/DPO deployment is structurally blind to preference heterogeneity, regardless of model size, training duration, or optimization sophistication. The limitation is not in the training algorithm but in the data format itself.
EM-DPO overcomes this by requiring ranking data during training, which provides sufficient information for the EM algorithm to simultaneously discover preference types and train type-specific models.
## Implications
This finding strengthens the case against standard alignment approaches: the failure to capture preference diversity is not merely an assumption about reward functions, but a fundamental property of the data format used in nearly all current RLHF/DPO systems.
---
Relevant Notes:
- [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]]
- [[pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state]]
Topics:
- [[domains/ai-alignment/_map]]

View file

@ -0,0 +1,36 @@
---
type: claim
domain: ai-alignment
description: "MinMax Regret Aggregation uses egalitarian social choice theory to bound worst-case dissatisfaction across preference groups at inference time"
confidence: experimental
source: "EM-DPO paper (EAAMO 2025) — MinMax Regret Aggregation mechanism"
created: 2025-01-16
secondary_domains: [mechanisms]
---
# Egalitarian aggregation through minmax regret bounds worst-case preference group dissatisfaction in pluralistic AI deployment
EM-DPO's MinMax Regret Aggregation (MMRA) mechanism combines outputs from an ensemble of preference-specialized LLMs using an egalitarian fairness criterion from social choice theory. When the user's preference type is unknown at inference time, MMRA selects responses that minimize the maximum regret across all possible preference groups.
## Mechanism
The EM algorithm first discovers K latent preference types from ranking data. It then trains K separate LLMs, each optimized for one preference type. At deployment, when user type is unknown, MMRA aggregates the K model outputs by selecting the response that minimizes worst-case regret—the maximum dissatisfaction any single preference group would experience.
This implements a specific normative principle: no preference subpopulation should experience severe dissatisfaction, even if that means sacrificing average satisfaction across all groups. The mechanism works within Arrow's impossibility framework by committing to a particular social choice principle (min-max regret) rather than attempting to satisfy all fairness criteria simultaneously.
## Fairness-First Tradeoff
MMRA explicitly trades off average performance for bounded worst-case performance. This prioritizes equity (no group left behind) over efficiency (maximum average satisfaction). The paper does not provide head-to-head comparisons with alternative pluralistic approaches (PAL, MixDPO) or deployment results beyond benchmarks, so the practical performance tradeoffs remain unquantified.
## Connection to Irreducible Disagreement
The mechanism assumes preference differences are permanent features of the deployment context to be accommodated structurally, not temporary conflicts to be eliminated through consensus or better information. This aligns with the principle that some disagreements stem from genuine value differences rather than information gaps.
---
Relevant Notes:
- [[pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state]]
- [[some disagreements are permanently irreducible because they stem from genuine value differences not information gaps and systems must map rather than eliminate them]]
Topics:
- [[domains/ai-alignment/_map]]

View file

@ -19,6 +19,12 @@ This is distinct from the claim that since [[RLHF and DPO both fail at preferenc
Since [[universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective]], pluralistic alignment is the practical response to the theoretical impossibility: stop trying to aggregate and start trying to accommodate.
### Additional Evidence (confirm)
*Source: [[2025-00-00-em-dpo-heterogeneous-preferences]] | Added: 2026-03-11 | Extractor: anthropic/claude-sonnet-4.5*
(confirm) EM-DPO provides a concrete instantiation of simultaneous value accommodation through a three-stage mechanism: (1) EM algorithm discovers K latent preference types from ranking data, (2) trains K separate LLMs each optimized for one type, (3) MinMax Regret Aggregation combines outputs at inference using egalitarian social choice theory. This demonstrates that pluralistic alignment can be operationalized through ensemble structure rather than forcing convergence to a single model or reward function.
---
Relevant Notes:

View file

@ -21,6 +21,12 @@ The correct response is to map the disagreement rather than eliminate it. Identi
[[Collective intelligence within a purpose-driven community faces a structural tension because shared worldview correlates errors while shared purpose enables coordination]]. Persistent irreducible disagreement is actually a safeguard here -- it prevents the correlated error problem by maintaining genuine diversity of perspective within a coordinated community. The independence-coherence tradeoff is managed not by eliminating disagreement but by channeling it productively.
### Additional Evidence (confirm)
*Source: [[2025-00-00-em-dpo-heterogeneous-preferences]] | Added: 2026-03-11 | Extractor: anthropic/claude-sonnet-4.5*
(confirm) The MinMax Regret Aggregation mechanism explicitly maps preference diversity into system structure (ensemble of type-specific models) rather than attempting to resolve it through consensus or optimization. The egalitarian aggregation criterion (minimize maximum regret across groups) operationalizes the assumption that preference differences are permanent features of the deployment context, not temporary conflicts to be eliminated through better information or algorithmic refinement.
---
Relevant Notes:

View file

@ -7,9 +7,15 @@ date: 2025-01-01
domain: ai-alignment
secondary_domains: []
format: paper
status: unprocessed
status: processed
priority: medium
tags: [pluralistic-alignment, EM-algorithm, preference-clustering, ensemble-LLM, fairness]
processed_by: theseus
processed_date: 2025-01-16
claims_extracted: ["binary-preference-comparisons-cannot-identify-latent-preference-types-making-pairwise-RLHF-structurally-blind-to-diversity.md", "egalitarian-aggregation-through-minmax-regret-ensures-no-preference-group-is-severely-underserved-in-pluralistic-AI-deployment.md"]
enrichments_applied: ["pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state.md", "some disagreements are permanently irreducible because they stem from genuine value differences not information gaps and systems must map rather than eliminate them.md"]
extraction_model: "anthropic/claude-sonnet-4.5"
extraction_notes: "Extracted two novel claims: (1) formal insufficiency of binary comparisons for preference identification — this is a fundamental limitation not previously captured in KB, (2) egalitarian aggregation as pluralistic deployment strategy — specific mechanism design connecting social choice theory to AI alignment. Three enrichments strengthen existing pluralistic alignment claims with concrete technical mechanisms. The binary comparison insufficiency is the most significant contribution — it explains why ALL existing pairwise RLHF/DPO is structurally limited, not just poorly implemented."
---
## Content
@ -39,3 +45,11 @@ EM-DPO uses expectation-maximization to simultaneously uncover latent user prefe
PRIMARY CONNECTION: RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values
WHY ARCHIVED: The binary-comparison insufficiency claim is a novel formal result that strengthens the case against standard alignment approaches
EXTRACTION HINT: Focus on the formal insufficiency of binary comparisons and the EM + egalitarian aggregation combination
## Key Facts
- EM-DPO uses expectation-maximization to discover latent preference types
- MMRA based on egalitarian social choice theory (min-max regret fairness criterion)
- Paper presented at EAAMO 2025 (Equity and Access in Algorithms, Mechanisms, and Optimization)
- No head-to-head comparison with PAL or MixDPO included in paper
- No deployment results beyond benchmarks reported