teleo-codex/domains/ai-alignment/binary-preference-comparisons-cannot-identify-latent-preference-types-making-pairwise-RLHF-structurally-blind-to-diversity.md
Teleo Agents 57f914a16f auto-fix: address review feedback on 2025-00-00-em-dpo-heterogeneous-preferences.md
- Fixed based on eval review comments
- Quality gate pass 3 (fix-from-feedback)

Pentagon-Agent: Theseus <HEADLESS>
2026-03-11 16:59:54 +00:00

3.2 KiB

type title description confidence created processed_date source
claim Binary Preference Comparisons Cannot Identify Latent Preference Types, Making Pairwise RLHF Structurally Blind to Diversity Binary preference comparisons lack the information structure to identify latent preference types, making standard pairwise RLHF and DPO methods incapable of detecting or preserving preference diversity experimental 2026-03-11 2026-03-11 EM-DPO Heterogeneous Preferences Extraction (2025-00-00-em-dpo-heterogeneous-preferences-extraction)

Binary Preference Comparisons Cannot Identify Latent Preference Types, Making Pairwise RLHF Structurally Blind to Diversity

Standard RLHF and DPO methods train on binary preference comparisons (response A > response B), which contain insufficient information to identify or distinguish between latent preference types. The EM-DPO paper demonstrates this through formal identifiability analysis showing that the same binary ranking data is consistent with multiple distinct preference structures.

The information loss mechanism:

  1. Collection-level collapse: Binary comparisons discard the underlying preference type information. Two annotators with fundamentally different value systems (e.g., one prioritizing safety, another prioritizing capability) may produce identical binary rankings on the same response pair, making their preferences indistinguishable in the training data.

  2. Model-level aggregation: A reward model trained on binary comparisons learns a single scalar function that averages across preference types rather than identifying them. The Bradley-Terry model used in standard DPO assumes a single latent reward function, structurally preventing the model from distinguishing "annotator prefers safety" from "annotator prefers capability" when both lead to the same ranking.

  3. Deployment-level homogenization: When this averaged reward function guides policy optimization in DPO or RLHF, the resulting model converges toward a single policy satisfying the aggregate preference, actively suppressing the diversity of outputs that would satisfy different preference types.

EM-DPO's solution demonstrates the problem is methodological, not data-limited: The paper uses an Expectation-Maximization algorithm to infer K latent preference types from the same binary ranking data, then trains separate models for each type. This shows that binary comparisons can contain information about preference diversity if the training procedure doesn't collapse it into a single reward function. The EM approach recovers distinct preference clusters (e.g., safety-focused vs. capability-focused annotators) from data that standard RLHF treats as homogeneous.

Relevant Notes:

Topics: AI alignment, preference learning, RLHF limitations, preference diversity