--- type: claim domain: ai-alignment description: "Binary preference comparisons lack the information-theoretic capacity to identify latent user preference subpopulations; rankings over 3+ responses are required" confidence: experimental source: "EM-DPO paper (EAAMO 2025) — formal identifiability analysis" created: 2025-01-16 --- # Binary preference comparisons cannot identify latent preference types, making pairwise RLHF structurally blind to diversity The EM-DPO paper presents a formal identifiability analysis demonstrating that binary preference comparisons—the standard data format for RLHF and DPO training—are mathematically insufficient to discover latent user preference subpopulations. The mechanism requires rankings over 3 or more responses to uncover heterogeneous preference types from preference data. ## Information-Theoretic Constraint This is not a practical limitation that better algorithms could overcome—it is a fundamental information-theoretic constraint. Binary comparisons simply do not contain enough information to distinguish between two scenarios: 1. All users share similar preferences that produce consistent pairwise choices 2. Users have genuinely diverse preferences that happen to produce similar pairwise rankings The EM algorithm's identifiability proof formalizes this gap: pairwise data cannot resolve this ambiguity, but ranking data over 3+ responses can. ## Structural Blindness in Deployed Systems This means every existing pairwise RLHF/DPO deployment is structurally blind to preference heterogeneity, regardless of model size, training duration, or optimization sophistication. The limitation is not in the training algorithm but in the data format itself. EM-DPO overcomes this by requiring ranking data during training, which provides sufficient information for the EM algorithm to simultaneously discover preference types and train type-specific models. ## Implications This finding strengthens the case against standard alignment approaches: the failure to capture preference diversity is not merely an assumption about reward functions, but a fundamental property of the data format used in nearly all current RLHF/DPO systems. --- Relevant Notes: - [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]] - [[pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state]] Topics: - [[domains/ai-alignment/_map]]