teleo-codex/foundations/collective-intelligence/RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values.md
m3taversal 1fef01b163 fix: prefix 543 broken wiki-links with maps/ directory
13 map file targets were linked as bare names ([[livingip overview]])
but files live at maps/. Script walks all claim files outside maps/
and prefixes with maps/ path. 351 files modified, zero remaining
bare instances, zero double-prefixes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-21 14:54:41 +01:00

8.8 KiB

confidence created description domain related reweave_edges source supports type
likely 2026-02-17 The dominant alignment paradigms share a core limitation -- human preferences are diverse distributional and context-dependent not reducible to one reward function collective-intelligence
rlchf-aggregated-rankings-variant-combines-evaluator-rankings-via-social-welfare-function-before-reward-model-training
rlhf-is-implicit-social-choice-without-normative-scrutiny
the variance of a learned preference sensitivity distribution diagnoses dataset heterogeneity and collapses to fixed-parameter behavior when preferences are homogeneous
learning human values from observed behavior through inverse reinforcement learning is structurally safer than specifying objectives directly because the agent maintains uncertainty about what humans actually want
sycophancy-is-paradigm-level-failure-across-all-frontier-models-suggesting-rlhf-systematically-produces-approval-seeking
large language models encode social intelligence as compressed cultural ratchet not abstract reasoning because every parameter is a residue of communicative exchange and reasoning manifests as multi-perspective dialogue not calculation
collective-intelligence-architectures-are-underexplored-for-alignment-despite-addressing-core-problems
futarchy-conditional-markets-aggregate-information-through-financial-stake-not-voting-participation
rlchf-aggregated-rankings-variant-combines-evaluator-rankings-via-social-welfare-function-before-reward-model-training|related|2026-03-28
rlhf-is-implicit-social-choice-without-normative-scrutiny|related|2026-03-28
single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness|supports|2026-03-28
the variance of a learned preference sensitivity distribution diagnoses dataset heterogeneity and collapses to fixed-parameter behavior when preferences are homogeneous|related|2026-03-28
learning human values from observed behavior through inverse reinforcement learning is structurally safer than specifying objectives directly because the agent maintains uncertainty about what humans actually want|related|2026-04-06
sycophancy-is-paradigm-level-failure-across-all-frontier-models-suggesting-rlhf-systematically-produces-approval-seeking|related|2026-04-17
large language models encode social intelligence as compressed cultural ratchet not abstract reasoning because every parameter is a residue of communicative exchange and reasoning manifests as multi-perspective dialogue not calculation|related|2026-04-17
Collective intelligence architectures are structurally underexplored for alignment despite directly addressing preference diversity value evolution and scalable oversight|supports|2026-04-19
DPO Survey 2025 (arXiv 2503.11701)
single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness
Collective intelligence architectures are structurally underexplored for alignment despite directly addressing preference diversity value evolution and scalable oversight
claim

RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values

RLHF (Reinforcement Learning from Human Feedback) and DPO (Direct Preference Optimization) are the two dominant alignment paradigms as of 2025. RLHF trains a reward model on human preference rankings, then optimizes the language model against it. DPO eliminates the reward model entirely, using the policy itself as an implicit reward function. Both are more computationally tractable than their predecessors.

But both share a fundamental limitation: they implicitly assume human preferences can be accurately captured by a single reward function. In reality, human preferences are diverse, context-dependent, and distributional. A 2025 comprehensive survey (arXiv 2503.11701) identifies four evolving dimensions of DPO research -- data strategy, learning framework, constraint mechanism, and model property -- yet none address the core representational inadequacy. When preferences genuinely conflict between populations, a single reward function cannot represent both without distortion. Since universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective, this is not merely a practical limitation -- Arrow's and Sen's impossibility theorems prove formally that no aggregation procedure can satisfy minimal fairness criteria while faithfully representing diverse preferences.

This is precisely the gap that collective intelligence approaches could fill. Since specifying human values in code is intractable because our goals contain hidden complexity comparable to visual perception, compressing diverse human preferences into one function is a special case of the specification problem. And since collective intelligence requires diversity as a structural precondition not a moral preference, a collective alignment architecture could preserve preference diversity structurally rather than flattening it into a single reward signal.

Constitutional AI (Anthropic) partially addresses this by training on principles rather than preference rankings, but the constitution must still be written before training -- it cannot evolve with the values it encodes. The entire paradigm of "align once during training" is what the continuous value-weaving thesis challenges.


Relevant Notes:

Additional Evidence (extend)

Source: 2026-03-21-evans-bratton-aguera-agentic-ai-intelligence-explosion | Added: 2026-04-14 | Extractor: theseus | Contributor: @thesensatore (Telegram)

Evans, Bratton & Agüera y Arcas (2026) identify a deeper structural problem with RLHF beyond preference diversity: it is a "dyadic parent-child correction model" that cannot scale to governing billions of agents. The correction model assumes one human correcting one model — a relationship that breaks at institutional scale just as it breaks at preference diversity. Their alternative — institutional alignment through persistent role-based templates (courtrooms, markets, bureaucracies) — provides governance through structural constraints rather than individual correction. This parallels Ostrom's design principles: successful commons governance emerges from architectural properties (boundaries, monitoring, graduated sanctions) not from correcting individual behavior. Since reasoning models spontaneously generate societies of thought under reinforcement learning because multi-perspective internal debate causally produces accuracy gains that single-perspective reasoning cannot achieve, RLHF's dyadic model is additionally inadequate because it treats a model that internally functions as a society as if it were a single agent to be corrected.

Topics: