teleo-codex/domains/ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md

---
type: claim
domain: ai-alignment
description: "Current RLHF implementations make social choice decisions about evaluator selection and preference aggregation without examining their normative properties"
confidence: likely
source: "Conitzer et al. (2024), 'Social Choice Should Guide AI Alignment' (ICML 2024)"
created: 2026-03-11
---

# RLHF is implicit social choice without normative scrutiny

Reinforcement Learning from Human Feedback (RLHF) necessarily makes social choice decisions—which humans provide input, what feedback is collected, how it's aggregated, and how it's used—but current implementations make these choices without examining their normative properties or drawing on 70+ years of social choice theory.

Conitzer et al. (2024) argue that RLHF practitioners implicitly answer fundamental social choice questions: Who gets to evaluate? How are conflicting preferences weighted? What aggregation method combines diverse judgments? These decisions have profound implications for whose values shape AI behavior, yet they're typically made based on convenience (e.g., using readily available crowdworker platforms) rather than principled normative reasoning.

The paper demonstrates that post-Arrow social choice theory has developed practical mechanisms that work within Arrow's impossibility constraints. RLHF essentially reinvented preference aggregation badly, ignoring decades of formal work on voting methods, welfare functions, and pluralistic decision-making.

## Evidence

- Conitzer et al. (2024) position paper at ICML 2024, co-authored by Stuart Russell (Berkeley CHAI) and leading social choice theorists
- Current RLHF uses convenience sampling (crowdworker platforms) rather than representative sampling or deliberative mechanisms
- The paper proposes RLCHF (Reinforcement Learning from Collective Human Feedback) as the formal alternative that makes social choice decisions explicit

## Relationship to Existing Work

This claim directly addresses the mechanism gap identified in [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]]. Where that claim focuses on the technical failure mode (single reward function), this claim identifies the root cause: RLHF makes social choice decisions without social choice theory.

The paper's proposed solution—RLCHF with explicit social welfare functions—connects to [[collective intelligence requires diversity as a structural precondition not a moral preference]] by formalizing how diverse evaluator input should be preserved rather than collapsed.


### Additional Evidence (extend)
*Source: 2025-06-00-li-scaling-human-judgment-community-notes-llms | Added: 2026-03-15*

RLCF makes the social choice mechanism explicit through the bridging algorithm (matrix factorization with intercept scores). Unlike standard RLHF which aggregates preferences opaquely through reward model training, RLCF's use of intercepts as the training signal is a deliberate choice to optimize for cross-partisan agreement—a specific social welfare function.


### Additional Evidence (confirm)
*Source: [[2026-02-00-an-differentiable-social-choice]] | Added: 2026-03-16*

Comprehensive February 2026 survey by An & Du documents that contemporary ML systems implement social choice mechanisms implicitly across RLHF, participatory budgeting, and liquid democracy applications, with 18 identified open problems spanning incentive guarantees and pluralistic preference aggregation.

---

Relevant Notes:
- [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]]
- [[collective intelligence requires diversity as a structural precondition not a moral preference]]
- [[AI alignment is a coordination problem not a technical problem]]

Topics:
- domains/ai-alignment/_map
- core/mechanisms/_map
- foundations/collective-intelligence/_map