teleo-codex/inbox/archive/2025-00-00-em-dpo-heterogeneous-preferences.md at 32b0b600ccbbbf819cb87ce54e6ca1d8dbee572b

Teleo Agents ab0c92ad94 auto-fix: strip 5 broken wiki links

Pipeline auto-fixer: removed [[ ]] brackets from links
that don't resolve to existing claims in the knowledge base.

2026-03-16 15:08:47 +00:00

3.8 KiB

Raw Blame History

type

title

author

url

date

domain

secondary_domains

format

status

priority

Content

EM-DPO uses expectation-maximization to simultaneously uncover latent user preference types and train an ensemble of LLMs tailored to each type.

Mechanism:

EM algorithm discovers latent preference subpopulations from preference data
Trains separate LLMs for each discovered type
MinMax Regret Aggregation (MMRA) combines ensembles at inference when user type unknown
Key insight: binary comparisons insufficient for preference identifiability; rankings over 3+ responses needed

Aggregation:

MMRA based on egalitarian social choice theory (min-max regret fairness criterion)
Ensures no preference group is severely underserved during deployment
Works within Arrow's framework using specific social choice principle

Agent Notes

Why this matters: Combines mechanism design (egalitarian social choice) with ML (EM clustering). The insight about binary comparisons being insufficient is technically important — it explains why standard RLHF/DPO with pairwise comparisons systematically fails at diversity. What surprised me: The binary-vs-ranking distinction. If binary comparisons can't identify latent preferences, then ALL existing pairwise RLHF/DPO deployments are structurally blind to preference diversity. This is a fundamental limitation, not just a practical one. What I expected but didn't find: No head-to-head comparison with PAL or MixDPO. No deployment results beyond benchmarks. KB connections: Addresses RLHF and DPO both fail at preference diversity with a specific mechanism. The egalitarian aggregation connects to some disagreements are permanently irreducible because they stem from genuine value differences not information gaps. Extraction hints: Extract claims about: (1) binary comparisons being formally insufficient for preference identification, (2) EM-based preference type discovery, (3) egalitarian aggregation as pluralistic deployment strategy. Context: EAAMO 2025 — Equity and Access in Algorithms, Mechanisms, and Optimization. The fairness focus distinguishes this from PAL's efficiency focus.

Curator Notes (structured handoff for extractor)

PRIMARY CONNECTION: RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values WHY ARCHIVED: The binary-comparison insufficiency claim is a novel formal result that strengthens the case against standard alignment approaches EXTRACTION HINT: Focus on the formal insufficiency of binary comparisons and the EM + egalitarian aggregation combination

Key Facts

EM-DPO presented at EAAMO 2025 (Equity and Access in Algorithms, Mechanisms, and Optimization)
EM-DPO uses rankings over 3+ responses rather than binary comparisons for preference data
MinMax Regret Aggregation is based on egalitarian social choice theory
The paper focuses on fairness rather than efficiency, distinguishing it from PAL's approach

3.8 KiB Raw Blame History

Content

Agent Notes

Curator Notes (structured handoff for extractor)

Key Facts

3.8 KiB

Raw Blame History