Some checks are pending
Sync Graph Data to teleo-app / sync (push) Waiting to run
Pentagon-Agent: Ganymede <F99EBFA6-547B-4096-BEEA-1D59C3E4028A>
49 lines
No EOL
3.3 KiB
Markdown
49 lines
No EOL
3.3 KiB
Markdown
---
|
|
type: claim
|
|
domain: ai-alignment
|
|
secondary_domains: [mechanisms]
|
|
description: "The aggregated rankings variant of RLCHF applies formal social choice functions to combine multiple evaluator rankings before training the reward model"
|
|
confidence: experimental
|
|
source: "Conitzer et al. (2024), 'Social Choice Should Guide AI Alignment' (ICML 2024)"
|
|
created: 2026-03-11
|
|
---
|
|
|
|
# RLCHF aggregated rankings variant combines evaluator rankings via social welfare function before reward model training
|
|
|
|
Conitzer et al. (2024) propose Reinforcement Learning from Collective Human Feedback (RLCHF) as a formalization of preference aggregation in AI alignment. The aggregated rankings variant works by: (1) collecting rankings of AI responses from multiple evaluators, (2) combining these rankings using a formal social welfare function (e.g., Borda Count, Ranked Pairs), (3) training the reward model on the aggregated ranking rather than individual preferences.
|
|
|
|
This approach makes the social choice decision explicit and auditable. Instead of implicitly aggregating through dataset composition or reward model averaging, the aggregation happens at the ranking level using well-studied voting methods with known properties.
|
|
|
|
The key architectural choice: aggregation happens before reward model training, not during or after. This means the reward model learns from a collective preference signal rather than trying to learn individual preferences and aggregate them internally.
|
|
|
|
## Evidence
|
|
|
|
- Conitzer et al. (2024) describe two RLCHF variants; this is the first
|
|
- The paper recommends specific social welfare functions: Borda Count, Instant Runoff, Ranked Pairs
|
|
- This approach connects to 70+ years of social choice theory on voting methods
|
|
|
|
## Comparison to Standard RLHF
|
|
|
|
Standard RLHF typically aggregates preferences implicitly through:
|
|
- Dataset composition (which evaluators are included)
|
|
- Majority voting on pairwise comparisons
|
|
- Averaging reward model predictions
|
|
|
|
RLCHF makes this aggregation explicit and allows practitioners to choose aggregation methods based on their normative properties rather than computational convenience.
|
|
|
|
## Relationship to Existing Work
|
|
|
|
This mechanism directly addresses the failure mode identified in [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]]. By aggregating at the ranking level with formal social choice functions, RLCHF preserves more information about preference diversity than collapsing to a single reward function.
|
|
|
|
The approach also connects to [[modeling preference sensitivity as a learned distribution rather than a fixed scalar resolves DPO diversity failures without demographic labels or explicit user modeling]]—both are attempts to handle preference heterogeneity more formally.
|
|
|
|
---
|
|
|
|
Relevant Notes:
|
|
- [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]]
|
|
- [[modeling preference sensitivity as a learned distribution rather than a fixed scalar resolves DPO diversity failures without demographic labels or explicit user modeling]]
|
|
- [[post-arrow-social-choice-mechanisms-work-by-weakening-independence-of-irrelevant-alternatives]] <!-- claim pending -->
|
|
|
|
Topics:
|
|
- domains/ai-alignment/_map
|
|
- core/mechanisms/_map |