diff --git a/domains/ai-alignment/binary-preference-comparisons-cannot-identify-latent-preference-types-making-pairwise-RLHF-structurally-blind-to-diversity.md b/domains/ai-alignment/binary-preference-comparisons-cannot-identify-latent-preference-types-making-pairwise-RLHF-structurally-blind-to-diversity.md index 2afeffbd1..9f729eb2d 100644 --- a/domains/ai-alignment/binary-preference-comparisons-cannot-identify-latent-preference-types-making-pairwise-RLHF-structurally-blind-to-diversity.md +++ b/domains/ai-alignment/binary-preference-comparisons-cannot-identify-latent-preference-types-making-pairwise-RLHF-structurally-blind-to-diversity.md @@ -1,20 +1,20 @@ --- type: claim -title: Binary Preference Comparisons Cannot Identify Latent Preference Types, Making Pairwise RLHF Structurally Blind to Diversity -description: Binary preference comparisons lack the information structure to identify latent preference types, making standard pairwise RLHF and DPO methods incapable of detecting or preserving preference diversity +title: Standard Pairwise RLHF Collapses Latent Preference Types Because Single-Reward-Function Training Cannot Recover Diversity That Binary Comparisons Encode +description: Binary preference comparisons contain information about preference diversity, but standard RLHF and DPO methods using single reward models structurally collapse this information, making them incapable of detecting or preserving preference heterogeneity confidence: experimental created: 2026-03-11 processed_date: 2026-03-11 source: "EM-DPO Heterogeneous Preferences Extraction (2025-00-00-em-dpo-heterogeneous-preferences-extraction)" --- -# Binary Preference Comparisons Cannot Identify Latent Preference Types, Making Pairwise RLHF Structurally Blind to Diversity +# Standard Pairwise RLHF Collapses Latent Preference Types Because Single-Reward-Function Training Cannot Recover Diversity That Binary Comparisons Encode -Standard RLHF and DPO methods train on binary preference comparisons (response A > response B), which contain insufficient information to identify or distinguish between latent preference types. The EM-DPO paper demonstrates this through formal identifiability analysis showing that the same binary ranking data is consistent with multiple distinct preference structures. +Standard RLHF and DPO methods train on binary preference comparisons (response A > response B), but their single-reward-function architecture prevents them from identifying or distinguishing between latent preference types. The EM-DPO paper demonstrates through formal identifiability analysis that binary ranking data contains sufficient information to recover preference diversity, but standard training procedures structurally collapse it. **The information loss mechanism:** -1. **Collection-level collapse**: Binary comparisons discard the underlying preference type information. Two annotators with fundamentally different value systems (e.g., one prioritizing safety, another prioritizing capability) may produce identical binary rankings on the same response pair, making their preferences indistinguishable in the training data. +1. **Collection-level collapse**: Binary comparisons discard the underlying preference type information during aggregation. Two annotators with fundamentally different value systems (e.g., one prioritizing safety, another prioritizing capability) may produce identical binary rankings on the same response pair, making their preferences indistinguishable in pooled training data. 2. **Model-level aggregation**: A reward model trained on binary comparisons learns a single scalar function that averages across preference types rather than identifying them. The Bradley-Terry model used in standard DPO assumes a single latent reward function, structurally preventing the model from distinguishing "annotator prefers safety" from "annotator prefers capability" when both lead to the same ranking. diff --git a/domains/ai-alignment/egalitarian-aggregation-through-minmax-regret-bounds-worst-case-preference-group-dissatisfaction-in-pluralistic-AI-deployment.md b/domains/ai-alignment/egalitarian-aggregation-through-minmax-regret-bounds-worst-case-preference-group-dissatisfaction-in-pluralistic-AI-deployment.md index 78cdd6a78..d3657b1a3 100644 --- a/domains/ai-alignment/egalitarian-aggregation-through-minmax-regret-bounds-worst-case-preference-group-dissatisfaction-in-pluralistic-AI-deployment.md +++ b/domains/ai-alignment/egalitarian-aggregation-through-minmax-regret-bounds-worst-case-preference-group-dissatisfaction-in-pluralistic-AI-deployment.md @@ -25,9 +25,11 @@ Regret is defined as the difference between a group's utility for their preferre Standard RLHF effectively implements utilitarian aggregation by maximizing average reward across all annotators. This can leave minority preference groups severely dissatisfied if their preferences conflict with the majority. MinMax Regret instead optimizes for the worst-off group, accepting lower average satisfaction to prevent extreme dissatisfaction for any group. -**Connection to Arrow's Impossibility Theorem:** +**Connection to social choice theory:** -Arrow proved that no aggregation mechanism can satisfy all fairness criteria simultaneously (unanimity, non-dictatorship, independence of irrelevant alternatives, transitivity) when preferences genuinely diverge. MinMax Regret accepts this impossibility and instead optimizes for a specific fairness criterion: egalitarian worst-case protection. It explicitly trades off average satisfaction for bounded inequality. +MinMax Regret is a well-established mechanism in social choice theory and mechanism design. Arrow's Impossibility Theorem proved that no aggregation mechanism can satisfy all fairness criteria simultaneously (unanimity, non-dictatorship, independence of irrelevant alternatives, transitivity) when preferences genuinely diverge. MinMax Regret accepts this impossibility and instead optimizes for a specific fairness criterion: egalitarian worst-case protection. It explicitly trades off average satisfaction for bounded inequality. + +**The novelty is application, not mechanism:** The theoretical foundations of MinMax Regret are proven in social choice theory. What's experimental is applying this aggregation strategy to LLM deployment with multiple preference-aligned models. The EM-DPO paper demonstrates this application empirically but doesn't claim to have invented the underlying mechanism. **Why this matters for pluralistic AI deployment:** @@ -35,6 +37,6 @@ In systems serving diverse populations with irreducible value differences, a sin **Relevant Notes:** - [[pluralistic-alignment-must-accommodate-irreducibly-diverse-values-simultaneously-rather-than-converging-on-a-single-aligned-state]] — MinMax Regret is a technical instantiation of this principle -- [[binary-preference-comparisons-cannot-identify-latent-preference-types-making-pairwise-RLHF-structurally-blind-to-diversity]] — EM-DPO's EM stage discovers the preference types that MinMax Regret then aggregates +- [[standard-pairwise-rlhf-collapses-latent-preference-types-because-single-reward-function-training-cannot-recover-diversity-that-binary-comparisons-encode]] — EM-DPO's EM stage discovers the preference types that MinMax Regret then aggregates **Topics:** AI alignment, social choice theory, fairness, preference aggregation, egalitarianism diff --git a/domains/ai-alignment/pluralistic-alignment-must-accommodate-irreducibly-diverse-values-simultaneously-rather-than-converging-on-a-single-aligned-state.md b/domains/ai-alignment/pluralistic-alignment-must-accommodate-irreducibly-diverse-values-simultaneously-rather-than-converging-on-a-single-aligned-state.md index 712b679a3..4275d94d6 100644 --- a/domains/ai-alignment/pluralistic-alignment-must-accommodate-irreducibly-diverse-values-simultaneously-rather-than-converging-on-a-single-aligned-state.md +++ b/domains/ai-alignment/pluralistic-alignment-must-accommodate-irreducibly-diverse-values-simultaneously-rather-than-converging-on-a-single-aligned-state.md @@ -26,8 +26,8 @@ Harland et al (NeurIPS 2024) propose the technical mechanism: Multi-Objective RL **EM-DPO enrichment (extend)**: The EM-DPO paper provides a concrete implementation of distributional pluralism through latent preference type discovery. Rather than treating preference diversity as noise to average out, EM-DPO uses Expectation-Maximization to identify K distinct preference clusters from binary comparison data, then trains separate models for each type. This operationalizes the principle that diverse values should be accommodated structurally (through model ensembles) rather than collapsed into consensus. The MinMax Regret aggregation strategy then ensures no preference group experiences catastrophic dissatisfaction at deployment time. **Relevant Notes:** -- [[binary-preference-comparisons-cannot-identify-latent-preference-types-making-pairwise-RLHF-structurally-blind-to-diversity]] — describes the technical failure mode +- [[standard-pairwise-rlhf-collapses-latent-preference-types-because-single-reward-function-training-cannot-recover-diversity-that-binary-comparisons-encode]] — describes the technical failure mode - [[egalitarian-aggregation-through-minmax-regret-bounds-worst-case-preference-group-dissatisfaction-in-pluralistic-AI-deployment]] — MinMax Regret is a technical instantiation of this principle -- [[democratic-alignment-assemblies-produce-constitutions-as-effective-as-expert-designed-ones-while-better-representing-diverse-populations]] — assemblies are one mechanism for pluralistic alignment +- [[democratic alignment assemblies produce constitutions as effective as expert-designed ones while better representing diverse populations]] — assemblies are one mechanism for pluralistic alignment **Topics:** AI alignment, preference diversity, value pluralism, multi-objective optimization