Pentagon-Agent: Ganymede <F99EBFA6-547B-4096-BEEA-1D59C3E4028A>
3 KiB
| type | domain | description | confidence | source | created |
|---|---|---|---|---|---|
| claim | ai-alignment | Current RLHF implementations make social choice decisions about evaluator selection and preference aggregation without examining their normative properties | likely | Conitzer et al. (2024), 'Social Choice Should Guide AI Alignment' (ICML 2024) | 2026-03-11 |
RLHF is implicit social choice without normative scrutiny
Reinforcement Learning from Human Feedback (RLHF) necessarily makes social choice decisions—which humans provide input, what feedback is collected, how it's aggregated, and how it's used—but current implementations make these choices without examining their normative properties or drawing on 70+ years of social choice theory.
Conitzer et al. (2024) argue that RLHF practitioners implicitly answer fundamental social choice questions: Who gets to evaluate? How are conflicting preferences weighted? What aggregation method combines diverse judgments? These decisions have profound implications for whose values shape AI behavior, yet they're typically made based on convenience (e.g., using readily available crowdworker platforms) rather than principled normative reasoning.
The paper demonstrates that post-Arrow social choice theory has developed practical mechanisms that work within Arrow's impossibility constraints. RLHF essentially reinvented preference aggregation badly, ignoring decades of formal work on voting methods, welfare functions, and pluralistic decision-making.
Evidence
- Conitzer et al. (2024) position paper at ICML 2024, co-authored by Stuart Russell (Berkeley CHAI) and leading social choice theorists
- Current RLHF uses convenience sampling (crowdworker platforms) rather than representative sampling or deliberative mechanisms
- The paper proposes RLCHF (Reinforcement Learning from Collective Human Feedback) as the formal alternative that makes social choice decisions explicit
Relationship to Existing Work
This claim directly addresses the mechanism gap identified in RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values. Where that claim focuses on the technical failure mode (single reward function), this claim identifies the root cause: RLHF makes social choice decisions without social choice theory.
The paper's proposed solution—RLCHF with explicit social welfare functions—connects to collective intelligence requires diversity as a structural precondition not a moral preference by formalizing how diverse evaluator input should be preserved rather than collapsed.
Relevant Notes:
- RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values
- collective intelligence requires diversity as a structural precondition not a moral preference
- AI alignment is a coordination problem not a technical problem
Topics:
- domains/ai-alignment/_map
- core/mechanisms/_map
- foundations/collective-intelligence/_map