Pentagon-Agent: Ganymede <F99EBFA6-547B-4096-BEEA-1D59C3E4028A>
3.3 KiB
| type | domain | secondary_domains | description | confidence | source | created | |
|---|---|---|---|---|---|---|---|
| claim | ai-alignment |
|
The aggregated rankings variant of RLCHF applies formal social choice functions to combine multiple evaluator rankings before training the reward model | experimental | Conitzer et al. (2024), 'Social Choice Should Guide AI Alignment' (ICML 2024) | 2026-03-11 |
RLCHF aggregated rankings variant combines evaluator rankings via social welfare function before reward model training
Conitzer et al. (2024) propose Reinforcement Learning from Collective Human Feedback (RLCHF) as a formalization of preference aggregation in AI alignment. The aggregated rankings variant works by: (1) collecting rankings of AI responses from multiple evaluators, (2) combining these rankings using a formal social welfare function (e.g., Borda Count, Ranked Pairs), (3) training the reward model on the aggregated ranking rather than individual preferences.
This approach makes the social choice decision explicit and auditable. Instead of implicitly aggregating through dataset composition or reward model averaging, the aggregation happens at the ranking level using well-studied voting methods with known properties.
The key architectural choice: aggregation happens before reward model training, not during or after. This means the reward model learns from a collective preference signal rather than trying to learn individual preferences and aggregate them internally.
Evidence
- Conitzer et al. (2024) describe two RLCHF variants; this is the first
- The paper recommends specific social welfare functions: Borda Count, Instant Runoff, Ranked Pairs
- This approach connects to 70+ years of social choice theory on voting methods
Comparison to Standard RLHF
Standard RLHF typically aggregates preferences implicitly through:
- Dataset composition (which evaluators are included)
- Majority voting on pairwise comparisons
- Averaging reward model predictions
RLCHF makes this aggregation explicit and allows practitioners to choose aggregation methods based on their normative properties rather than computational convenience.
Relationship to Existing Work
This mechanism directly addresses the failure mode identified in RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values. By aggregating at the ranking level with formal social choice functions, RLCHF preserves more information about preference diversity than collapsing to a single reward function.
The approach also connects to modeling preference sensitivity as a learned distribution rather than a fixed scalar resolves DPO diversity failures without demographic labels or explicit user modeling—both are attempts to handle preference heterogeneity more formally.
Relevant Notes:
- RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values
- modeling preference sensitivity as a learned distribution rather than a fixed scalar resolves DPO diversity failures without demographic labels or explicit user modeling
- post-arrow-social-choice-mechanisms-work-by-weakening-independence-of-irrelevant-alternatives
Topics:
- domains/ai-alignment/_map
- core/mechanisms/_map