teleo-codex/domains/ai-alignment/pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state.md
m3taversal d868633493 integrate 99 orphan claims across 6 domain clusters
Three parallel agents connected isolated claims to related files:
- ai-alignment: 34 files, governance/coordination orphans linked
- health: 32 files, CVD/mortality/food-industry orphans linked
- space-development: 19 files
- internet-finance: 8 files (futarchy, zkTLS orphans)
- collective-intelligence: 4 files
- core/teleohumanity: 2 files

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-21 10:35:42 +01:00

60 lines
6.2 KiB
Markdown

---
description: Three forms of alignment pluralism -- Overton steerable and distributional -- are needed because standard alignment procedures actively reduce the diversity of model outputs
type: claim
domain: ai-alignment
created: 2026-02-17
source: "Sorensen et al, Roadmap to Pluralistic Alignment (arXiv 2402.05070, ICML 2024); Klassen et al, Pluralistic Alignment Over Time (arXiv 2411.10654, NeurIPS 2024); Harland et al, Adaptive Alignment (arXiv 2410.23630, NeurIPS 2024)"
confidence: likely
related:
- minority-preference-alignment-improves-33-percent-without-majority-compromise-suggesting-single-reward-leaves-value-on-table
- the variance of a learned preference sensitivity distribution diagnoses dataset heterogeneity and collapses to fixed-parameter behavior when preferences are homogeneous
- collective-intelligence-architectures-are-underexplored-for-alignment-despite-addressing-core-problems
reweave_edges:
- minority-preference-alignment-improves-33-percent-without-majority-compromise-suggesting-single-reward-leaves-value-on-table|related|2026-03-28
- pluralistic-ai-alignment-through-multiple-systems-preserves-value-diversity-better-than-forced-consensus|supports|2026-03-28
- single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness|supports|2026-03-28
- the variance of a learned preference sensitivity distribution diagnoses dataset heterogeneity and collapses to fixed-parameter behavior when preferences are homogeneous|related|2026-03-28
supports:
- pluralistic-ai-alignment-through-multiple-systems-preserves-value-diversity-better-than-forced-consensus
- single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness
---
# pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state
Sorensen et al (ICML 2024, led by Yejin Choi) define three forms of alignment pluralism. Overton pluralistic models present a spectrum of reasonable responses rather than a single "correct" answer. Steerably pluralistic models can be directed to reflect specific perspectives when appropriate. Distributionally pluralistic models are calibrated to represent values proportional to a given population. The critical finding: standard alignment procedures (RLHF, DPO) may actively reduce distributional pluralism in models -- the training intended to make models safer also makes them less capable of representing diverse viewpoints.
Klassen et al (NeurIPS 2024) add the temporal dimension: in sequential decision-making, conflicting stakeholder preferences can be addressed over time rather than resolved in a single decision. The AI reflects different stakeholders' values at different times, applying fairness-over-time frameworks. This is alignment as ongoing negotiation, not one-shot specification.
Harland et al (NeurIPS 2024) propose the technical mechanism: Multi-Objective RL with post-learning policy selection adjustment that dynamically adapts to diverse and shifting user preferences, making alignment itself adaptive rather than fixed.
This is distinct from the claim that since [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]] -- that note describes a technical failure mode. Pluralistic alignment is the positive research program: what alignment looks like when you take diversity as irreducible rather than treating it as noise to be averaged out. Since [[collective intelligence requires diversity as a structural precondition not a moral preference]], pluralistic alignment imports this structural insight into the alignment field -- diversity is not a problem to be solved but a feature to be preserved.
Since [[universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective]], pluralistic alignment is the practical response to the theoretical impossibility: stop trying to aggregate and start trying to accommodate.
### Additional Evidence (extend)
*Source: 2024-02-00-chakraborty-maxmin-rlhf | Added: 2026-03-15 | Extractor: anthropic/claude-sonnet-4.5*
MaxMin-RLHF provides a constructive implementation of pluralistic alignment through mixture-of-rewards and egalitarian optimization. Rather than converging preferences, it learns separate reward models for each subpopulation and optimizes for the worst-off group (Sen's Egalitarian principle). At Tulu2-7B scale, this achieved 56.67% win rate across both majority and minority groups, compared to single-reward's 70.4%/42% split. The mechanism accommodates irreducible diversity by maintaining separate reward functions rather than forcing convergence.
### Additional Evidence (confirm)
*Source: [[2025-00-00-em-dpo-heterogeneous-preferences]] | Added: 2026-03-16*
EM-DPO implements this through ensemble architecture: discovers K latent preference types, trains K specialized models, and deploys them simultaneously with egalitarian aggregation. Demonstrates that pluralistic alignment is technically feasible without requiring demographic labels or manual preference specification.
---
Relevant Notes:
- [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]] -- the technical failure that motivates pluralistic alternatives
- [[universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective]] -- pluralistic alignment is the practical response to this impossibility
- [[collective intelligence requires diversity as a structural precondition not a moral preference]] -- imports this insight into alignment: diversity preserved, not averaged
- [[the specification trap means any values encoded at training time become structurally unstable as deployment contexts diverge from training conditions]] -- pluralism plus temporal adaptation addresses the specification trap
- [[democratic alignment assemblies produce constitutions as effective as expert-designed ones while better representing diverse populations]] -- assemblies are one mechanism for pluralistic alignment
Topics:
- [[_map]]