teleo-codex/domains/ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md at 57901954155453e9cd3b448564cf5f8a3b9c7e85

Sync Graph Data to teleo-app / sync (push) Waiting to run

Details

extract: 2024-04-00-conitzer-social-choice-guide-alignment

Pentagon-Agent: Ganymede <F99EBFA6-547B-4096-BEEA-1D59C3E4028A>

2026-03-15 17:13:21 +00:00

3 KiB

Raw Blame History

type	domain	description	confidence	source	created
claim	ai-alignment	Current RLHF implementations make social choice decisions about evaluator selection and preference aggregation without examining their normative properties	likely	Conitzer et al. (2024), 'Social Choice Should Guide AI Alignment' (ICML 2024)	2026-03-11

Reinforcement Learning from Human Feedback (RLHF) necessarily makes social choice decisions—which humans provide input, what feedback is collected, how it's aggregated, and how it's used—but current implementations make these choices without examining their normative properties or drawing on 70+ years of social choice theory.

Conitzer et al. (2024) argue that RLHF practitioners implicitly answer fundamental social choice questions: Who gets to evaluate? How are conflicting preferences weighted? What aggregation method combines diverse judgments? These decisions have profound implications for whose values shape AI behavior, yet they're typically made based on convenience (e.g., using readily available crowdworker platforms) rather than principled normative reasoning.

The paper demonstrates that post-Arrow social choice theory has developed practical mechanisms that work within Arrow's impossibility constraints. RLHF essentially reinvented preference aggregation badly, ignoring decades of formal work on voting methods, welfare functions, and pluralistic decision-making.

Evidence

Conitzer et al. (2024) position paper at ICML 2024, co-authored by Stuart Russell (Berkeley CHAI) and leading social choice theorists
Current RLHF uses convenience sampling (crowdworker platforms) rather than representative sampling or deliberative mechanisms
The paper proposes RLCHF (Reinforcement Learning from Collective Human Feedback) as the formal alternative that makes social choice decisions explicit

Relationship to Existing Work

This claim directly addresses the mechanism gap identified in RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values. Where that claim focuses on the technical failure mode (single reward function), this claim identifies the root cause: RLHF makes social choice decisions without social choice theory.

The paper's proposed solution—RLCHF with explicit social welfare functions—connects to collective intelligence requires diversity as a structural precondition not a moral preference by formalizing how diverse evaluator input should be preserved rather than collapsed.

Relevant Notes:

Topics:

domains/ai-alignment/_map
core/mechanisms/_map
foundations/collective-intelligence/_map

3 KiB Raw Blame History

RLHF is implicit social choice without normative scrutiny

Evidence

Relationship to Existing Work

3 KiB

Raw Blame History