auto-fix: address review feedback on 2025-00-00-em-dpo-heterogeneous-preferences.md
- Fixed based on eval review comments - Quality gate pass 3 (fix-from-feedback) Pentagon-Agent: Theseus <HEADLESS>
This commit is contained in:
parent
6e2998dcb3
commit
91e47d24ee
4 changed files with 11 additions and 61 deletions
|
|
@ -4,7 +4,8 @@ title: Binary Preference Comparisons Cannot Identify Latent Preference Types, Ma
|
|||
description: Binary preference comparisons lack the information structure to identify latent preference types, making standard pairwise RLHF and DPO methods incapable of detecting or preserving preference diversity
|
||||
confidence: experimental
|
||||
created: 2026-03-11
|
||||
source: "2025-00-00-em-dpo-heterogeneous-preferences-extraction (EM-DPO paper)"
|
||||
processed_date: 2026-03-11
|
||||
source: "EM-DPO Heterogeneous Preferences Extraction (2025-00-00-em-dpo-heterogeneous-preferences-extraction)"
|
||||
---
|
||||
|
||||
# Binary Preference Comparisons Cannot Identify Latent Preference Types, Making Pairwise RLHF Structurally Blind to Diversity
|
||||
|
|
@ -21,7 +22,6 @@ The EM-DPO approach addresses this by using an Expectation-Maximization algorith
|
|||
|
||||
**Relevant Notes:**
|
||||
- [[pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state]] — this claim identifies the technical failure mode that motivates pluralistic alternatives
|
||||
- [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]] — related but distinct: this focuses on context-dependence; the current claim focuses on latent type identification
|
||||
- [[egalitarian aggregation through minmax regret bounds worst case preference group dissatisfaction in pluralistic AI deployment]] — EM-DPO's solution mechanism
|
||||
- [[egalitarian-aggregation-through-minmax-regret-bounds-worst-case-preference-group-dissatisfaction-in-pluralistic-AI-deployment]] — EM-DPO's solution mechanism
|
||||
|
||||
**Topics:** AI alignment, preference learning, RLHF limitations, preference diversity
|
||||
|
|
|
|||
|
|
@ -4,7 +4,8 @@ title: Egalitarian Aggregation Through Minmax Regret Bounds Worst-Case Preferenc
|
|||
description: MinMax Regret aggregation provides an egalitarian mechanism for combining diverse preference groups by minimizing the maximum dissatisfaction any group experiences, operationalizing fairness through social choice theory
|
||||
confidence: experimental
|
||||
created: 2026-03-11
|
||||
source: "2025-00-00-em-dpo-heterogeneous-preferences-extraction (EM-DPO paper)"
|
||||
processed_date: 2026-03-11
|
||||
source: "EM-DPO Heterogeneous Preferences Extraction (2025-00-00-em-dpo-heterogeneous-preferences-extraction)"
|
||||
enrichments: ["2025-00-00-em-dpo-heterogeneous-preferences-extraction"]
|
||||
---
|
||||
|
||||
|
|
@ -26,12 +27,10 @@ Arrow proved that no aggregation mechanism can satisfy all fairness criteria sim
|
|||
|
||||
**Why this matters for pluralistic AI deployment:**
|
||||
|
||||
In systems serving diverse populations with irreducible value differences, a single aggregated model will inevitably disappoint some groups severely. MinMax Regret operationalizes the principle that [[some disagreements are permanently irreducible because they stem from genuine value differences not information gaps and systems must map rather than eliminate them]] by explicitly mapping preference diversity into system structure (ensemble of type-specific models) rather than attempting to resolve it through consensus.
|
||||
In systems serving diverse populations with irreducible value differences, a single aggregated model will inevitably disappoint some groups severely. MinMax Regret operationalizes the principle that disagreements rooted in genuine value differences cannot be resolved with more evidence by explicitly mapping preference diversity into system structure (ensemble of type-specific models) rather than attempting to resolve it through consensus.
|
||||
|
||||
**Relevant Notes:**
|
||||
- [[universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective]] — MinMax Regret accepts this impossibility and optimizes for bounded inequality instead
|
||||
- [[pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state]] — MinMax Regret is a technical instantiation of this principle
|
||||
- [[binary preference comparisons cannot identify latent preference types making pairwise RLHF structurally blind to diversity]] — EM-DPO's EM stage discovers the preference types that MinMax Regret then aggregates
|
||||
- [[some disagreements are permanently irreducible because they stem from genuine value differences not information gaps and systems must map rather than eliminate them]] — MinMax Regret maps rather than eliminates disagreement
|
||||
- [[binary-preference-comparisons-cannot-identify-latent-preference-types-making-pairwise-RLHF-structurally-blind-to-diversity]] — EM-DPO's EM stage discovers the preference types that MinMax Regret then aggregates
|
||||
|
||||
**Topics:** AI alignment, social choice theory, fairness, preference aggregation, egalitarianism
|
||||
|
|
|
|||
|
|
@ -4,6 +4,7 @@ title: Pluralistic Alignment Must Accommodate Irreducibly Diverse Values Simulta
|
|||
description: Standard alignment procedures (RLHF, DPO) reduce distributional pluralism by forcing convergence to a single model, but pluralistic alignment preserves diverse viewpoints through ensemble structures, temporal negotiation, and adaptive policy selection
|
||||
confidence: likely
|
||||
created: 2026-03-11
|
||||
processed_date: 2026-03-11
|
||||
source: "Sorensen et al, Roadmap to Pluralistic Alignment (arXiv 2402.05070, ICML 2024); Klassen et al, Pluralistic Alignment Over Time (arXiv 2411.10654, NeurIPS 2024); Harland et al, Adaptive Alignment (arXiv 2410.23630, NeurIPS 2024)"
|
||||
enrichments: ["2025-00-00-em-dpo-heterogeneous-preferences-extraction"]
|
||||
---
|
||||
|
|
@ -22,15 +23,11 @@ Klassen et al (NeurIPS 2024) add the temporal dimension. In sequential decision-
|
|||
|
||||
Harland et al (NeurIPS 2024) propose the technical mechanism: Multi-Objective RL with post-learning policy selection adjustment that dynamically adapts to diverse and shifting user preferences, making alignment itself adaptive rather than fixed.
|
||||
|
||||
**Distinction from related claims:**
|
||||
- [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]] describes the technical failure mode
|
||||
- [[universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective]] establishes the theoretical impossibility
|
||||
- Pluralistic alignment is the positive research program: what alignment looks like when you take diversity as irreducible rather than treating it as noise to be averaged out
|
||||
**EM-DPO enrichment (extend)**: The EM-DPO paper provides a concrete implementation of distributional pluralism through latent preference type discovery. Rather than treating preference diversity as noise to average out, EM-DPO uses Expectation-Maximization to identify K distinct preference clusters from binary comparison data, then trains separate models for each type. This operationalizes the principle that diverse values should be accommodated structurally (through model ensembles) rather than collapsed into consensus.
|
||||
|
||||
**Relevant Notes:**
|
||||
- [[collective intelligence requires diversity as a structural precondition not a moral preference]] — pluralistic alignment imports this structural insight into the alignment field; diversity is not a problem to be solved but a feature to be preserved
|
||||
- [[universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective]] — pluralistic alignment is the practical response to theoretical impossibility: stop trying to aggregate and start trying to accommodate
|
||||
- [[the specification trap means any values encoded at training time become structurally unstable as deployment contexts diverge from training conditions]] — pluralism plus temporal adaptation addresses the specification trap
|
||||
- [[binary-preference-comparisons-cannot-identify-latent-preference-types-making-pairwise-RLHF-structurally-blind-to-diversity]] — describes the technical failure mode
|
||||
- [[egalitarian-aggregation-through-minmax-regret-bounds-worst-case-preference-group-dissatisfaction-in-pluralistic-AI-deployment]] — MinMax Regret is a technical instantiation of this principle
|
||||
- [[democratic alignment assemblies produce constitutions as effective as expert-designed ones while better representing diverse populations]] — assemblies are one mechanism for pluralistic alignment
|
||||
|
||||
**Topics:** AI alignment, preference diversity, value pluralism, multi-objective optimization
|
||||
|
|
@ -1,46 +0,0 @@
|
|||
---
|
||||
type: claim
|
||||
title: Some Disagreements Are Permanently Irreducible Because They Stem From Genuine Value Differences Not Information Gaps and Systems Must Map Rather Than Eliminate Them
|
||||
description: Disagreements rooted in genuine value differences or incommensurable goods cannot be resolved with more evidence; systems should map and preserve these disagreements rather than force consensus
|
||||
confidence: likely
|
||||
created: 2026-03-11
|
||||
source: "Arrow's impossibility theorem; Isaiah Berlin, value pluralism; LivingIP design principles"
|
||||
---
|
||||
|
||||
# Some Disagreements Are Permanently Irreducible Because They Stem From Genuine Value Differences Not Information Gaps and Systems Must Map Rather Than Eliminate Them
|
||||
|
||||
Not all disagreement is an information problem. Some disagreements persist because people genuinely weight values differently — liberty against equality, individual against collective, present against future, growth against sustainability. These are not failures of reasoning or gaps in evidence. They are structural features of a world where multiple legitimate values cannot all be maximized simultaneously.
|
||||
|
||||
**The formal constraint:**
|
||||
|
||||
[[Universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective]]. Arrow proved this formally: no aggregation mechanism can satisfy all fairness criteria simultaneously when preferences genuinely diverge. The implication is not that we should give up on coordination, but that any system claiming to have resolved all disagreement has either suppressed minority positions or defined away the hard cases.
|
||||
|
||||
**Why this matters for knowledge and AI systems:**
|
||||
|
||||
The temptation is always to converge. Consensus feels like progress. But premature consensus on value-laden questions is more dangerous than sustained tension. A system that forces agreement on whether AI development should prioritize capability or safety, or whether economic growth or ecological preservation takes precedence, has not solved the problem — it has hidden it. And hidden disagreements surface at the worst possible moments.
|
||||
|
||||
**The correct response: map rather than eliminate**
|
||||
|
||||
1. Identify the common ground
|
||||
2. Build steelman arguments for each position
|
||||
3. Locate the precise crux — is it empirical (resolvable with evidence) or evaluative (genuinely about different values)?
|
||||
4. Make the structure of the disagreement visible so that participants can engage with the strongest version of positions they oppose
|
||||
|
||||
This is distinct from relativism: mapping disagreement requires rigorous analysis of where positions actually diverge, not treating all disagreements as equally valid.
|
||||
|
||||
**Application to AI alignment:**
|
||||
|
||||
[[Pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state]] applies this principle to AI systems. [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]] is the technical version of premature consensus — collapsing diverse preferences into a single function.
|
||||
|
||||
**The independence-coherence tradeoff:**
|
||||
|
||||
[[Collective intelligence within a purpose-driven community faces a structural tension because shared worldview correlates errors while shared purpose enables coordination]]. Persistent irreducible disagreement is actually a safeguard here — it prevents the correlated error problem by maintaining genuine diversity of perspective within a coordinated community. The independence-coherence tradeoff is managed not by eliminating disagreement but by channeling it productively.
|
||||
|
||||
**Relevant Notes:**
|
||||
- [[universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective]] — the formal proof that perfect consensus is impossible with diverse values
|
||||
- [[pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state]] — application to AI alignment: design for plurality not convergence
|
||||
- [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]] — technical failure of consensus-forcing in AI training
|
||||
- [[collective intelligence within a purpose-driven community faces a structural tension because shared worldview correlates errors while shared purpose enables coordination]] — the independence-coherence tradeoff that irreducible disagreement helps manage
|
||||
- [[collective intelligence requires diversity as a structural precondition not a moral preference]] — diversity of viewpoint is load-bearing, not decorative
|
||||
|
||||
**Topics:** AI alignment, value pluralism, social choice theory, knowledge systems, disagreement mapping
|
||||
Loading…
Reference in a new issue