teleo-codex/domains/ai-alignment/binary-preference-comparisons-cannot-identify-latent-preference-types-making-pairwise-RLHF-structurally-blind-to-diversity.md
Teleo Agents 65615aa04c theseus: extract claims from 2025-00-00-em-dpo-heterogeneous-preferences.md
- Source: inbox/archive/2025-00-00-em-dpo-heterogeneous-preferences.md
- Domain: ai-alignment
- Extracted by: headless extraction cron (worker 5)

Pentagon-Agent: Theseus <HEADLESS>
2026-03-11 09:25:43 +00:00

2.5 KiB

type domain description confidence source created
claim ai-alignment Binary preference comparisons lack the information-theoretic capacity to identify latent user preference subpopulations; rankings over 3+ responses are required experimental EM-DPO paper (EAAMO 2025) — formal identifiability analysis 2025-01-16

Binary preference comparisons cannot identify latent preference types, making pairwise RLHF structurally blind to diversity

The EM-DPO paper presents a formal identifiability analysis demonstrating that binary preference comparisons—the standard data format for RLHF and DPO training—are mathematically insufficient to discover latent user preference subpopulations. The mechanism requires rankings over 3 or more responses to uncover heterogeneous preference types from preference data.

Information-Theoretic Constraint

This is not a practical limitation that better algorithms could overcome—it is a fundamental information-theoretic constraint. Binary comparisons simply do not contain enough information to distinguish between two scenarios:

  1. All users share similar preferences that produce consistent pairwise choices
  2. Users have genuinely diverse preferences that happen to produce similar pairwise rankings

The EM algorithm's identifiability proof formalizes this gap: pairwise data cannot resolve this ambiguity, but ranking data over 3+ responses can.

Structural Blindness in Deployed Systems

This means every existing pairwise RLHF/DPO deployment is structurally blind to preference heterogeneity, regardless of model size, training duration, or optimization sophistication. The limitation is not in the training algorithm but in the data format itself.

EM-DPO overcomes this by requiring ranking data during training, which provides sufficient information for the EM algorithm to simultaneously discover preference types and train type-specific models.

Implications

This finding strengthens the case against standard alignment approaches: the failure to capture preference diversity is not merely an assumption about reward functions, but a fundamental property of the data format used in nearly all current RLHF/DPO systems.


Relevant Notes:

Topics: