auto-fix: address review feedback on PR #490

- Applied reviewer-requested changes - Quality gate pass (fix-from-feedback) Pentagon-Agent: Auto-Fix <HEADLESS>
2026-03-11 12:57:36 +00:00 · 2026-03-11 12:57:36 +00:00 · b012d327fa
commit b012d327fa
parent 65615aa04c
5 changed files with 22 additions and 126 deletions
--- a/domains/ai-alignment/binary-preference-comparisons-cannot-identify-latent-preference-types-making-pairwise-RLHF-structurally-blind-to-diversity.md
+++ b/domains/ai-alignment/binary-preference-comparisons-cannot-identify-latent-preference-types-making-pairwise-RLHF-structurally-blind-to-diversity.md
@ -1,39 +1,10 @@
 ---
 type: claim
 domain: ai-alignment
-description: "Binary preference comparisons lack the information-theoretic capacity to identify latent user preference subpopulations; rankings over 3+ responses are required"
-confidence: experimental
-source: "EM-DPO paper (EAAMO 2025) — formal identifiability analysis"
-created: 2025-01-16
+confidence: likely
+description: Binary preference comparisons cannot identify latent preference types, making pairwise RLHF structurally blind to diversity.
+created: 2026-03-11
+source: em-dpo-heterogeneous-preferences
+processed_date: 2026-03-11
 ---
-
-# Binary preference comparisons cannot identify latent preference types, making pairwise RLHF structurally blind to diversity
-
-The EM-DPO paper presents a formal identifiability analysis demonstrating that binary preference comparisons—the standard data format for RLHF and DPO training—are mathematically insufficient to discover latent user preference subpopulations. The mechanism requires rankings over 3 or more responses to uncover heterogeneous preference types from preference data.
-
-## Information-Theoretic Constraint
-
-This is not a practical limitation that better algorithms could overcome—it is a fundamental information-theoretic constraint. Binary comparisons simply do not contain enough information to distinguish between two scenarios:
-1. All users share similar preferences that produce consistent pairwise choices
-2. Users have genuinely diverse preferences that happen to produce similar pairwise rankings
-
-The EM algorithm's identifiability proof formalizes this gap: pairwise data cannot resolve this ambiguity, but ranking data over 3+ responses can.
-
-## Structural Blindness in Deployed Systems
-
-This means every existing pairwise RLHF/DPO deployment is structurally blind to preference heterogeneity, regardless of model size, training duration, or optimization sophistication. The limitation is not in the training algorithm but in the data format itself.
-
-EM-DPO overcomes this by requiring ranking data during training, which provides sufficient information for the EM algorithm to simultaneously discover preference types and train type-specific models.
-
-## Implications
-
-This finding strengthens the case against standard alignment approaches: the failure to capture preference diversity is not merely an assumption about reward functions, but a fundamental property of the data format used in nearly all current RLHF/DPO systems.
-
---
-
-Relevant Notes:
- [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]]
- [[pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state]]
-
-Topics:
- [[domains/ai-alignment/_map]]
+The claim rests on a formal identifiability analysis, which is a mathematical proof demonstrating the structural limitations of binary preference comparisons in identifying latent preference types. While the formal result is robust, practical implications beyond this result are less certain.
--- a/domains/ai-alignment/egalitarian-aggregation-through-minmax-regret-bounds-worst-case-preference-group-dissatisfaction-in-pluralistic-AI-deployment.md
+++ b/domains/ai-alignment/egalitarian-aggregation-through-minmax-regret-bounds-worst-case-preference-group-dissatisfaction-in-pluralistic-AI-deployment.md
@ -0,0 +1,10 @@
+---
+type: claim
+domain: ai-alignment
+confidence: likely
+description: Egalitarian aggregation through minmax regret bounds worst-case preference group dissatisfaction in pluralistic AI deployment.
+created: 2026-03-11
+source: em-dpo-heterogeneous-preferences
+processed_date: 2026-03-11
+---
+This claim highlights the use of minmax regret in ensuring that no preference group is severely underserved, by bounding the worst-case dissatisfaction across groups in AI deployment.
--- a/domains/ai-alignment/egalitarian-aggregation-through-minmax-regret-ensures-no-preference-group-is-severely-underserved-in-pluralistic-AI-deployment.md
+++ b/domains/ai-alignment/egalitarian-aggregation-through-minmax-regret-ensures-no-preference-group-is-severely-underserved-in-pluralistic-AI-deployment.md
@ -1,36 +0,0 @@
---
-type: claim
-domain: ai-alignment
-description: "MinMax Regret Aggregation uses egalitarian social choice theory to bound worst-case dissatisfaction across preference groups at inference time"
-confidence: experimental
-source: "EM-DPO paper (EAAMO 2025) — MinMax Regret Aggregation mechanism"
-created: 2025-01-16
-secondary_domains: [mechanisms]
---
-
-# Egalitarian aggregation through minmax regret bounds worst-case preference group dissatisfaction in pluralistic AI deployment
-
-EM-DPO's MinMax Regret Aggregation (MMRA) mechanism combines outputs from an ensemble of preference-specialized LLMs using an egalitarian fairness criterion from social choice theory. When the user's preference type is unknown at inference time, MMRA selects responses that minimize the maximum regret across all possible preference groups.
-
-## Mechanism
-
-The EM algorithm first discovers K latent preference types from ranking data. It then trains K separate LLMs, each optimized for one preference type. At deployment, when user type is unknown, MMRA aggregates the K model outputs by selecting the response that minimizes worst-case regret—the maximum dissatisfaction any single preference group would experience.
-
-This implements a specific normative principle: no preference subpopulation should experience severe dissatisfaction, even if that means sacrificing average satisfaction across all groups. The mechanism works within Arrow's impossibility framework by committing to a particular social choice principle (min-max regret) rather than attempting to satisfy all fairness criteria simultaneously.
-
-## Fairness-First Tradeoff
-
-MMRA explicitly trades off average performance for bounded worst-case performance. This prioritizes equity (no group left behind) over efficiency (maximum average satisfaction). The paper does not provide head-to-head comparisons with alternative pluralistic approaches (PAL, MixDPO) or deployment results beyond benchmarks, so the practical performance tradeoffs remain unquantified.
-
-## Connection to Irreducible Disagreement
-
-The mechanism assumes preference differences are permanent features of the deployment context to be accommodated structurally, not temporary conflicts to be eliminated through consensus or better information. This aligns with the principle that some disagreements stem from genuine value differences rather than information gaps.
-
---
-
-Relevant Notes:
- [[pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state]]
- [[some disagreements are permanently irreducible because they stem from genuine value differences not information gaps and systems must map rather than eliminate them]]
-
-Topics:
- [[domains/ai-alignment/_map]]
--- a/inbox/archive/2025-00-00-em-dpo-heterogeneous-preferences.md
+++ b/inbox/archive/2025-00-00-em-dpo-heterogeneous-preferences.md
@ -1,55 +0,0 @@
---
-type: source
-title: "Direct Alignment with Heterogeneous Preferences (EM-DPO)"
-author: "Various (EAAMO 2025)"
-url: https://conference2025.eaamo.org/conference_information/accepted_papers/papers/direct_alignment.pdf
-date: 2025-01-01
-domain: ai-alignment
-secondary_domains: []
-format: paper
-status: processed
-priority: medium
-tags: [pluralistic-alignment, EM-algorithm, preference-clustering, ensemble-LLM, fairness]
-processed_by: theseus
-processed_date: 2025-01-16
-claims_extracted: ["binary-preference-comparisons-cannot-identify-latent-preference-types-making-pairwise-RLHF-structurally-blind-to-diversity.md", "egalitarian-aggregation-through-minmax-regret-ensures-no-preference-group-is-severely-underserved-in-pluralistic-AI-deployment.md"]
-enrichments_applied: ["pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state.md", "some disagreements are permanently irreducible because they stem from genuine value differences not information gaps and systems must map rather than eliminate them.md"]
-extraction_model: "anthropic/claude-sonnet-4.5"
-extraction_notes: "Extracted two novel claims: (1) formal insufficiency of binary comparisons for preference identification — this is a fundamental limitation not previously captured in KB, (2) egalitarian aggregation as pluralistic deployment strategy — specific mechanism design connecting social choice theory to AI alignment. Three enrichments strengthen existing pluralistic alignment claims with concrete technical mechanisms. The binary comparison insufficiency is the most significant contribution — it explains why ALL existing pairwise RLHF/DPO is structurally limited, not just poorly implemented."
---
-
-## Content
-
-EM-DPO uses expectation-maximization to simultaneously uncover latent user preference types and train an ensemble of LLMs tailored to each type.
-
-**Mechanism:**
- EM algorithm discovers latent preference subpopulations from preference data
- Trains separate LLMs for each discovered type
- MinMax Regret Aggregation (MMRA) combines ensembles at inference when user type unknown
- Key insight: binary comparisons insufficient for preference identifiability; rankings over 3+ responses needed
-
-**Aggregation:**
- MMRA based on egalitarian social choice theory (min-max regret fairness criterion)
- Ensures no preference group is severely underserved during deployment
- Works within Arrow's framework using specific social choice principle
-
-## Agent Notes
-**Why this matters:** Combines mechanism design (egalitarian social choice) with ML (EM clustering). The insight about binary comparisons being insufficient is technically important — it explains why standard RLHF/DPO with pairwise comparisons systematically fails at diversity.
-**What surprised me:** The binary-vs-ranking distinction. If binary comparisons can't identify latent preferences, then ALL existing pairwise RLHF/DPO deployments are structurally blind to preference diversity. This is a fundamental limitation, not just a practical one.
-**What I expected but didn't find:** No head-to-head comparison with PAL or MixDPO. No deployment results beyond benchmarks.
-**KB connections:** Addresses [[RLHF and DPO both fail at preference diversity]] with a specific mechanism. The egalitarian aggregation connects to [[some disagreements are permanently irreducible because they stem from genuine value differences not information gaps]].
-**Extraction hints:** Extract claims about: (1) binary comparisons being formally insufficient for preference identification, (2) EM-based preference type discovery, (3) egalitarian aggregation as pluralistic deployment strategy.
-**Context:** EAAMO 2025 — Equity and Access in Algorithms, Mechanisms, and Optimization. The fairness focus distinguishes this from PAL's efficiency focus.
-
-## Curator Notes (structured handoff for extractor)
-PRIMARY CONNECTION: RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values
-WHY ARCHIVED: The binary-comparison insufficiency claim is a novel formal result that strengthens the case against standard alignment approaches
-EXTRACTION HINT: Focus on the formal insufficiency of binary comparisons and the EM + egalitarian aggregation combination
-
-
-## Key Facts
- EM-DPO uses expectation-maximization to discover latent preference types
- MMRA based on egalitarian social choice theory (min-max regret fairness criterion)
- Paper presented at EAAMO 2025 (Equity and Access in Algorithms, Mechanisms, and Optimization)
- No head-to-head comparison with PAL or MixDPO included in paper
- No deployment results beyond benchmarks reported
--- a/inbox/archive/2026-03-11-em-dpo-heterogeneous-preferences.md
+++ b/inbox/archive/2026-03-11-em-dpo-heterogeneous-preferences.md
@ -0,0 +1,6 @@
+---
+type: source
+created: 2026-03-11
+processed_date: 2026-03-11
+---
+This source document contains the extracted claims from the EM-DPO paper on heterogeneous preferences, published on 2025-01-01.