teleo-codex/domains/ai-alignment/rlchf-aggregated-rankings-variant-combines-evaluator-rankings-via-social-welfare-function-before-reward-model-training.md at ca00aeee083473d72651d8173413b43ddf2a4e7c

Sync Graph Data to teleo-app / sync (push) Waiting to run

Details

extract: 2024-04-00-conitzer-social-choice-guide-alignment

Pentagon-Agent: Ganymede <F99EBFA6-547B-4096-BEEA-1D59C3E4028A>

2026-03-15 17:13:21 +00:00

3.3 KiB

Raw Blame History

type

domain

secondary_domains

description

confidence

source

created

claim

ai-alignment

mechanisms

The aggregated rankings variant of RLCHF applies formal social choice functions to combine multiple evaluator rankings before training the reward model

experimental

Conitzer et al. (2024), 'Social Choice Should Guide AI Alignment' (ICML 2024)

2026-03-11

Conitzer et al. (2024) propose Reinforcement Learning from Collective Human Feedback (RLCHF) as a formalization of preference aggregation in AI alignment. The aggregated rankings variant works by: (1) collecting rankings of AI responses from multiple evaluators, (2) combining these rankings using a formal social welfare function (e.g., Borda Count, Ranked Pairs), (3) training the reward model on the aggregated ranking rather than individual preferences.

This approach makes the social choice decision explicit and auditable. Instead of implicitly aggregating through dataset composition or reward model averaging, the aggregation happens at the ranking level using well-studied voting methods with known properties.

The key architectural choice: aggregation happens before reward model training, not during or after. This means the reward model learns from a collective preference signal rather than trying to learn individual preferences and aggregate them internally.

Evidence

Conitzer et al. (2024) describe two RLCHF variants; this is the first
The paper recommends specific social welfare functions: Borda Count, Instant Runoff, Ranked Pairs
This approach connects to 70+ years of social choice theory on voting methods

Comparison to Standard RLHF

Standard RLHF typically aggregates preferences implicitly through:

Dataset composition (which evaluators are included)
Majority voting on pairwise comparisons
Averaging reward model predictions

RLCHF makes this aggregation explicit and allows practitioners to choose aggregation methods based on their normative properties rather than computational convenience.

Relationship to Existing Work

This mechanism directly addresses the failure mode identified in RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values. By aggregating at the ranking level with formal social choice functions, RLCHF preserves more information about preference diversity than collapsing to a single reward function.

The approach also connects to modeling preference sensitivity as a learned distribution rather than a fixed scalar resolves DPO diversity failures without demographic labels or explicit user modeling—both are attempts to handle preference heterogeneity more formally.

Relevant Notes:

Topics:

domains/ai-alignment/_map
core/mechanisms/_map

3.3 KiB Raw Blame History

RLCHF aggregated rankings variant combines evaluator rankings via social welfare function before reward model training

Evidence

Comparison to Standard RLHF

Relationship to Existing Work

3.3 KiB

Raw Blame History