auto-fix: address review feedback on 2025-00-00-em-dpo-heterogeneous-preferences.md
- Fixed based on eval review comments - Quality gate pass 3 (fix-from-feedback) Pentagon-Agent: Theseus <HEADLESS>
This commit is contained in:
parent
0c7bc49517
commit
6e2998dcb3
4 changed files with 108 additions and 52 deletions
|
|
@ -1,6 +1,27 @@
|
|||
---
|
||||
type: claim
|
||||
title: Binary Preference Comparisons Cannot Identify Latent Preference Types, Making Pairwise RLHF Structurally Blind to Diversity
|
||||
confidence: likely
|
||||
description: Binary preference comparisons lack the information structure to identify latent preference types, making standard pairwise RLHF and DPO methods incapable of detecting or preserving preference diversity
|
||||
confidence: experimental
|
||||
created: 2026-03-11
|
||||
source: "2025-00-00-em-dpo-heterogeneous-preferences-extraction (EM-DPO paper)"
|
||||
---
|
||||
This claim discusses the limitations of binary preference comparisons in identifying latent preference types, which makes pairwise RLHF structurally blind to diversity. The claim is supported by a formal identifiability analysis and mathematical proof detailed in Section 3 of the source paper. This directly challenges standard RLHF/DPO approaches, particularly in preference identification. Relevant Notes: This claim strengthens the argument against the universality of binary comparison methods in RLHF. Topics: AI alignment, preference diversity, RLHF limitations.
|
||||
|
||||
# Binary Preference Comparisons Cannot Identify Latent Preference Types, Making Pairwise RLHF Structurally Blind to Diversity
|
||||
|
||||
Standard RLHF and DPO methods train on binary preference comparisons (response A > response B), which contain insufficient information to identify or distinguish between latent preference types. A formal identifiability analysis shows that the same binary ranking data is consistent with multiple distinct preference structures. This means:
|
||||
|
||||
1. **Information loss at collection**: Binary comparisons discard the underlying preference type information. Two annotators with fundamentally different value systems may produce identical binary rankings on the same pair.
|
||||
|
||||
2. **Structural blindness**: A reward model trained on binary comparisons learns a single scalar function that averages across preference types rather than identifying them. The model cannot distinguish between "annotator prefers safety" and "annotator prefers capability" if both lead to the same ranking on a given pair.
|
||||
|
||||
3. **Diversity collapse**: When this averaged reward function is used in DPO or RLHF, the resulting model converges toward a single policy that satisfies the aggregate preference, actively suppressing the diversity of outputs that would satisfy different preference types.
|
||||
|
||||
The EM-DPO approach addresses this by using an Expectation-Maximization algorithm to infer K latent preference types from the same binary ranking data, then training separate models for each type. This demonstrates that the limitation is not in the data but in the aggregation method: binary comparisons *can* contain information about preference diversity if you don't collapse it into a single reward function.
|
||||
|
||||
**Relevant Notes:**
|
||||
- [[pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state]] — this claim identifies the technical failure mode that motivates pluralistic alternatives
|
||||
- [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]] — related but distinct: this focuses on context-dependence; the current claim focuses on latent type identification
|
||||
- [[egalitarian aggregation through minmax regret bounds worst case preference group dissatisfaction in pluralistic AI deployment]] — EM-DPO's solution mechanism
|
||||
|
||||
**Topics:** AI alignment, preference learning, RLHF limitations, preference diversity
|
||||
|
|
|
|||
|
|
@ -1,6 +1,37 @@
|
|||
---
|
||||
type: claim
|
||||
title: Egalitarian Aggregation Through Minmax Regret Bounds Worst-Case Preference Group Dissatisfaction in Pluralistic AI Deployment
|
||||
confidence: likely
|
||||
description: MinMax Regret aggregation provides an egalitarian mechanism for combining diverse preference groups by minimizing the maximum dissatisfaction any group experiences, operationalizing fairness through social choice theory
|
||||
confidence: experimental
|
||||
created: 2026-03-11
|
||||
source: "2025-00-00-em-dpo-heterogeneous-preferences-extraction (EM-DPO paper)"
|
||||
enrichments: ["2025-00-00-em-dpo-heterogeneous-preferences-extraction"]
|
||||
---
|
||||
This claim explores the use of minmax regret as a method for egalitarian aggregation, which bounds the worst-case preference group dissatisfaction in pluralistic AI deployment. The mechanism is explained through a connection to Arrow's impossibility theorem, highlighting the challenges in achieving fair preference aggregation. Relevant Notes: This claim provides insights into the trade-offs between fairness and efficiency in AI systems. Topics: AI ethics, preference aggregation, Arrow's theorem.
|
||||
|
||||
# Egalitarian Aggregation Through Minmax Regret Bounds Worst-Case Preference Group Dissatisfaction in Pluralistic AI Deployment
|
||||
|
||||
MinMax Regret aggregation provides a formal mechanism for combining outputs from multiple preference-aligned models while guaranteeing fairness across groups. Rather than optimizing average satisfaction (which can leave minorities severely dissatisfied), MinMax Regret minimizes the maximum regret experienced by any preference group.
|
||||
|
||||
**The mechanism:**
|
||||
|
||||
1. Train K separate models, each optimized for one latent preference type (discovered via EM algorithm)
|
||||
2. At inference, for each query, evaluate all K models' outputs
|
||||
3. Select the output that minimizes the maximum regret across groups: min_output max_group (regret_group(output))
|
||||
|
||||
This ensures no single preference group experiences catastrophic dissatisfaction, even if it means average satisfaction is lower than a utilitarian aggregation would achieve.
|
||||
|
||||
**Connection to Arrow's Impossibility Theorem:**
|
||||
|
||||
Arrow proved that no aggregation mechanism can satisfy all fairness criteria simultaneously when preferences genuinely diverge. MinMax Regret accepts this impossibility and instead optimizes for a specific fairness criterion: egalitarian worst-case protection. It trades off average satisfaction for bounded inequality.
|
||||
|
||||
**Why this matters for pluralistic AI deployment:**
|
||||
|
||||
In systems serving diverse populations with irreducible value differences, a single aggregated model will inevitably disappoint some groups severely. MinMax Regret operationalizes the principle that [[some disagreements are permanently irreducible because they stem from genuine value differences not information gaps and systems must map rather than eliminate them]] by explicitly mapping preference diversity into system structure (ensemble of type-specific models) rather than attempting to resolve it through consensus.
|
||||
|
||||
**Relevant Notes:**
|
||||
- [[universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective]] — MinMax Regret accepts this impossibility and optimizes for bounded inequality instead
|
||||
- [[pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state]] — MinMax Regret is a technical instantiation of this principle
|
||||
- [[binary preference comparisons cannot identify latent preference types making pairwise RLHF structurally blind to diversity]] — EM-DPO's EM stage discovers the preference types that MinMax Regret then aggregates
|
||||
- [[some disagreements are permanently irreducible because they stem from genuine value differences not information gaps and systems must map rather than eliminate them]] — MinMax Regret maps rather than eliminates disagreement
|
||||
|
||||
**Topics:** AI alignment, social choice theory, fairness, preference aggregation, egalitarianism
|
||||
|
|
|
|||
|
|
@ -1,38 +1,36 @@
|
|||
---
|
||||
description: Three forms of alignment pluralism -- Overton steerable and distributional -- are needed because standard alignment procedures actively reduce the diversity of model outputs
|
||||
type: claim
|
||||
domain: ai-alignment
|
||||
created: 2026-02-17
|
||||
source: "Sorensen et al, Roadmap to Pluralistic Alignment (arXiv 2402.05070, ICML 2024); Klassen et al, Pluralistic Alignment Over Time (arXiv 2411.10654, NeurIPS 2024); Harland et al, Adaptive Alignment (arXiv 2410.23630, NeurIPS 2024)"
|
||||
title: Pluralistic Alignment Must Accommodate Irreducibly Diverse Values Simultaneously Rather Than Converging on a Single Aligned State
|
||||
description: Standard alignment procedures (RLHF, DPO) reduce distributional pluralism by forcing convergence to a single model, but pluralistic alignment preserves diverse viewpoints through ensemble structures, temporal negotiation, and adaptive policy selection
|
||||
confidence: likely
|
||||
created: 2026-03-11
|
||||
source: "Sorensen et al, Roadmap to Pluralistic Alignment (arXiv 2402.05070, ICML 2024); Klassen et al, Pluralistic Alignment Over Time (arXiv 2411.10654, NeurIPS 2024); Harland et al, Adaptive Alignment (arXiv 2410.23630, NeurIPS 2024)"
|
||||
enrichments: ["2025-00-00-em-dpo-heterogeneous-preferences-extraction"]
|
||||
---
|
||||
|
||||
# pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state
|
||||
# Pluralistic Alignment Must Accommodate Irreducibly Diverse Values Simultaneously Rather Than Converging on a Single Aligned State
|
||||
|
||||
Sorensen et al (ICML 2024, led by Yejin Choi) define three forms of alignment pluralism. Overton pluralistic models present a spectrum of reasonable responses rather than a single "correct" answer. Steerably pluralistic models can be directed to reflect specific perspectives when appropriate. Distributionally pluralistic models are calibrated to represent values proportional to a given population. The critical finding: standard alignment procedures (RLHF, DPO) may actively reduce distributional pluralism in models -- the training intended to make models safer also makes them less capable of representing diverse viewpoints.
|
||||
Sorensen et al (ICML 2024, led by Yejin Choi) define three forms of alignment pluralism:
|
||||
|
||||
Klassen et al (NeurIPS 2024) add the temporal dimension: in sequential decision-making, conflicting stakeholder preferences can be addressed over time rather than resolved in a single decision. The AI reflects different stakeholders' values at different times, applying fairness-over-time frameworks. This is alignment as ongoing negotiation, not one-shot specification.
|
||||
- **Overton pluralistic models** present a spectrum of reasonable responses rather than a single "correct" answer
|
||||
- **Steerably pluralistic models** can be directed to reflect specific perspectives when appropriate
|
||||
- **Distributionally pluralistic models** are calibrated to represent values proportional to a given population
|
||||
|
||||
The critical finding: standard alignment procedures (RLHF, DPO) may actively reduce distributional pluralism. The training intended to make models safer also makes them less capable of representing diverse viewpoints. This is not a side effect but a structural consequence of forcing diverse preferences into a single reward function.
|
||||
|
||||
Klassen et al (NeurIPS 2024) add the temporal dimension. In sequential decision-making, conflicting stakeholder preferences can be addressed over time rather than resolved in a single decision. The AI reflects different stakeholders' values at different times, applying fairness-over-time frameworks. This reframes alignment as ongoing negotiation rather than one-shot specification.
|
||||
|
||||
Harland et al (NeurIPS 2024) propose the technical mechanism: Multi-Objective RL with post-learning policy selection adjustment that dynamically adapts to diverse and shifting user preferences, making alignment itself adaptive rather than fixed.
|
||||
|
||||
This is distinct from the claim that since [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]] -- that note describes a technical failure mode. Pluralistic alignment is the positive research program: what alignment looks like when you take diversity as irreducible rather than treating it as noise to be averaged out. Since [[collective intelligence requires diversity as a structural precondition not a moral preference]], pluralistic alignment imports this structural insight into the alignment field -- diversity is not a problem to be solved but a feature to be preserved.
|
||||
**Distinction from related claims:**
|
||||
- [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]] describes the technical failure mode
|
||||
- [[universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective]] establishes the theoretical impossibility
|
||||
- Pluralistic alignment is the positive research program: what alignment looks like when you take diversity as irreducible rather than treating it as noise to be averaged out
|
||||
|
||||
Since [[universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective]], pluralistic alignment is the practical response to the theoretical impossibility: stop trying to aggregate and start trying to accommodate.
|
||||
**Relevant Notes:**
|
||||
- [[collective intelligence requires diversity as a structural precondition not a moral preference]] — pluralistic alignment imports this structural insight into the alignment field; diversity is not a problem to be solved but a feature to be preserved
|
||||
- [[universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective]] — pluralistic alignment is the practical response to theoretical impossibility: stop trying to aggregate and start trying to accommodate
|
||||
- [[the specification trap means any values encoded at training time become structurally unstable as deployment contexts diverge from training conditions]] — pluralism plus temporal adaptation addresses the specification trap
|
||||
- [[democratic alignment assemblies produce constitutions as effective as expert-designed ones while better representing diverse populations]] — assemblies are one mechanism for pluralistic alignment
|
||||
|
||||
|
||||
### Additional Evidence (confirm)
|
||||
*Source: [[2025-00-00-em-dpo-heterogeneous-preferences]] | Added: 2026-03-11 | Extractor: anthropic/claude-sonnet-4.5*
|
||||
|
||||
(confirm) EM-DPO provides a concrete instantiation of simultaneous value accommodation through a three-stage mechanism: (1) EM algorithm discovers K latent preference types from ranking data, (2) trains K separate LLMs each optimized for one type, (3) MinMax Regret Aggregation combines outputs at inference using egalitarian social choice theory. This demonstrates that pluralistic alignment can be operationalized through ensemble structure rather than forcing convergence to a single model or reward function.
|
||||
|
||||
---
|
||||
|
||||
Relevant Notes:
|
||||
- [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]] -- the technical failure that motivates pluralistic alternatives
|
||||
- [[universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective]] -- pluralistic alignment is the practical response to this impossibility
|
||||
- [[collective intelligence requires diversity as a structural precondition not a moral preference]] -- imports this insight into alignment: diversity preserved, not averaged
|
||||
- [[the specification trap means any values encoded at training time become structurally unstable as deployment contexts diverge from training conditions]] -- pluralism plus temporal adaptation addresses the specification trap
|
||||
- [[democratic alignment assemblies produce constitutions as effective as expert-designed ones while better representing diverse populations]] -- assemblies are one mechanism for pluralistic alignment
|
||||
|
||||
Topics:
|
||||
- [[_map]]
|
||||
**Topics:** AI alignment, preference diversity, value pluralism, multi-objective optimization
|
||||
|
|
|
|||
|
|
@ -1,40 +1,46 @@
|
|||
---
|
||||
description: Some disagreements cannot be resolved with more evidence because they stem from genuine value differences or incommensurable goods and systems must map rather than eliminate them
|
||||
type: claim
|
||||
domain: ai-alignment
|
||||
created: 2026-03-02
|
||||
title: Some Disagreements Are Permanently Irreducible Because They Stem From Genuine Value Differences Not Information Gaps and Systems Must Map Rather Than Eliminate Them
|
||||
description: Disagreements rooted in genuine value differences or incommensurable goods cannot be resolved with more evidence; systems should map and preserve these disagreements rather than force consensus
|
||||
confidence: likely
|
||||
source: "Arrow's impossibility theorem; value pluralism (Isaiah Berlin); LivingIP design principles"
|
||||
created: 2026-03-11
|
||||
source: "Arrow's impossibility theorem; Isaiah Berlin, value pluralism; LivingIP design principles"
|
||||
---
|
||||
|
||||
# some disagreements are permanently irreducible because they stem from genuine value differences not information gaps and systems must map rather than eliminate them
|
||||
# Some Disagreements Are Permanently Irreducible Because They Stem From Genuine Value Differences Not Information Gaps and Systems Must Map Rather Than Eliminate Them
|
||||
|
||||
Not all disagreement is an information problem. Some disagreements persist because people genuinely weight values differently -- liberty against equality, individual against collective, present against future, growth against sustainability. These are not failures of reasoning or gaps in evidence. They are structural features of a world where multiple legitimate values cannot all be maximized simultaneously.
|
||||
Not all disagreement is an information problem. Some disagreements persist because people genuinely weight values differently — liberty against equality, individual against collective, present against future, growth against sustainability. These are not failures of reasoning or gaps in evidence. They are structural features of a world where multiple legitimate values cannot all be maximized simultaneously.
|
||||
|
||||
**The formal constraint:**
|
||||
|
||||
[[Universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective]]. Arrow proved this formally: no aggregation mechanism can satisfy all fairness criteria simultaneously when preferences genuinely diverge. The implication is not that we should give up on coordination, but that any system claiming to have resolved all disagreement has either suppressed minority positions or defined away the hard cases.
|
||||
|
||||
This matters for knowledge systems because the temptation is always to converge. Consensus feels like progress. But premature consensus on value-laden questions is more dangerous than sustained tension. A system that forces agreement on whether AI development should prioritize capability or safety, or whether economic growth or ecological preservation takes precedence, has not solved the problem -- it has hidden it. And hidden disagreements surface at the worst possible moments.
|
||||
**Why this matters for knowledge and AI systems:**
|
||||
|
||||
The correct response is to map the disagreement rather than eliminate it. Identify the common ground. Build steelman arguments for each position. Locate the precise crux -- is it empirical (resolvable with evidence) or evaluative (genuinely about different values)? Make the structure of the disagreement visible so that participants can engage with the strongest version of positions they oppose.
|
||||
The temptation is always to converge. Consensus feels like progress. But premature consensus on value-laden questions is more dangerous than sustained tension. A system that forces agreement on whether AI development should prioritize capability or safety, or whether economic growth or ecological preservation takes precedence, has not solved the problem — it has hidden it. And hidden disagreements surface at the worst possible moments.
|
||||
|
||||
[[Pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state]] -- this is the same principle applied to AI systems. [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]] -- collapsing diverse preferences into a single function is the technical version of premature consensus.
|
||||
**The correct response: map rather than eliminate**
|
||||
|
||||
[[Collective intelligence within a purpose-driven community faces a structural tension because shared worldview correlates errors while shared purpose enables coordination]]. Persistent irreducible disagreement is actually a safeguard here -- it prevents the correlated error problem by maintaining genuine diversity of perspective within a coordinated community. The independence-coherence tradeoff is managed not by eliminating disagreement but by channeling it productively.
|
||||
1. Identify the common ground
|
||||
2. Build steelman arguments for each position
|
||||
3. Locate the precise crux — is it empirical (resolvable with evidence) or evaluative (genuinely about different values)?
|
||||
4. Make the structure of the disagreement visible so that participants can engage with the strongest version of positions they oppose
|
||||
|
||||
This is distinct from relativism: mapping disagreement requires rigorous analysis of where positions actually diverge, not treating all disagreements as equally valid.
|
||||
|
||||
### Additional Evidence (confirm)
|
||||
*Source: [[2025-00-00-em-dpo-heterogeneous-preferences]] | Added: 2026-03-11 | Extractor: anthropic/claude-sonnet-4.5*
|
||||
**Application to AI alignment:**
|
||||
|
||||
(confirm) The MinMax Regret Aggregation mechanism explicitly maps preference diversity into system structure (ensemble of type-specific models) rather than attempting to resolve it through consensus or optimization. The egalitarian aggregation criterion (minimize maximum regret across groups) operationalizes the assumption that preference differences are permanent features of the deployment context, not temporary conflicts to be eliminated through better information or algorithmic refinement.
|
||||
[[Pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state]] applies this principle to AI systems. [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]] is the technical version of premature consensus — collapsing diverse preferences into a single function.
|
||||
|
||||
---
|
||||
**The independence-coherence tradeoff:**
|
||||
|
||||
Relevant Notes:
|
||||
- [[universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective]] -- the formal proof that perfect consensus is impossible with diverse values
|
||||
- [[pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state]] -- application to AI alignment: design for plurality not convergence
|
||||
- [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]] -- technical failure of consensus-forcing in AI training
|
||||
- [[collective intelligence within a purpose-driven community faces a structural tension because shared worldview correlates errors while shared purpose enables coordination]] -- the independence-coherence tradeoff that irreducible disagreement helps manage
|
||||
- [[collective intelligence requires diversity as a structural precondition not a moral preference]] -- diversity of viewpoint is load-bearing, not decorative
|
||||
[[Collective intelligence within a purpose-driven community faces a structural tension because shared worldview correlates errors while shared purpose enables coordination]]. Persistent irreducible disagreement is actually a safeguard here — it prevents the correlated error problem by maintaining genuine diversity of perspective within a coordinated community. The independence-coherence tradeoff is managed not by eliminating disagreement but by channeling it productively.
|
||||
|
||||
Topics:
|
||||
- [[_map]]
|
||||
**Relevant Notes:**
|
||||
- [[universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective]] — the formal proof that perfect consensus is impossible with diverse values
|
||||
- [[pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state]] — application to AI alignment: design for plurality not convergence
|
||||
- [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]] — technical failure of consensus-forcing in AI training
|
||||
- [[collective intelligence within a purpose-driven community faces a structural tension because shared worldview correlates errors while shared purpose enables coordination]] — the independence-coherence tradeoff that irreducible disagreement helps manage
|
||||
- [[collective intelligence requires diversity as a structural precondition not a moral preference]] — diversity of viewpoint is load-bearing, not decorative
|
||||
|
||||
**Topics:** AI alignment, value pluralism, social choice theory, knowledge systems, disagreement mapping
|
||||
|
|
|
|||
Loading…
Reference in a new issue