teleo-codex/domains/ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md
Teleo Pipeline e4506bd6ce
Some checks are pending
Sync Graph Data to teleo-app / sync (push) Waiting to run
extract: 2024-04-00-conitzer-social-choice-guide-alignment
Pentagon-Agent: Ganymede <F99EBFA6-547B-4096-BEEA-1D59C3E4028A>
2026-03-15 17:13:21 +00:00

3 KiB

type domain description confidence source created
claim ai-alignment Current RLHF implementations make social choice decisions about evaluator selection and preference aggregation without examining their normative properties likely Conitzer et al. (2024), 'Social Choice Should Guide AI Alignment' (ICML 2024) 2026-03-11

RLHF is implicit social choice without normative scrutiny

Reinforcement Learning from Human Feedback (RLHF) necessarily makes social choice decisions—which humans provide input, what feedback is collected, how it's aggregated, and how it's used—but current implementations make these choices without examining their normative properties or drawing on 70+ years of social choice theory.

Conitzer et al. (2024) argue that RLHF practitioners implicitly answer fundamental social choice questions: Who gets to evaluate? How are conflicting preferences weighted? What aggregation method combines diverse judgments? These decisions have profound implications for whose values shape AI behavior, yet they're typically made based on convenience (e.g., using readily available crowdworker platforms) rather than principled normative reasoning.

The paper demonstrates that post-Arrow social choice theory has developed practical mechanisms that work within Arrow's impossibility constraints. RLHF essentially reinvented preference aggregation badly, ignoring decades of formal work on voting methods, welfare functions, and pluralistic decision-making.

Evidence

  • Conitzer et al. (2024) position paper at ICML 2024, co-authored by Stuart Russell (Berkeley CHAI) and leading social choice theorists
  • Current RLHF uses convenience sampling (crowdworker platforms) rather than representative sampling or deliberative mechanisms
  • The paper proposes RLCHF (Reinforcement Learning from Collective Human Feedback) as the formal alternative that makes social choice decisions explicit

Relationship to Existing Work

This claim directly addresses the mechanism gap identified in RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values. Where that claim focuses on the technical failure mode (single reward function), this claim identifies the root cause: RLHF makes social choice decisions without social choice theory.

The paper's proposed solution—RLCHF with explicit social welfare functions—connects to collective intelligence requires diversity as a structural precondition not a moral preference by formalizing how diverse evaluator input should be preserved rather than collapsed.


Relevant Notes:

Topics:

  • domains/ai-alignment/_map
  • core/mechanisms/_map
  • foundations/collective-intelligence/_map