teleo-codex/domains/ai-alignment/rlchf-aggregated-rankings-variant-combines-evaluator-rankings-via-social-welfare-function-before-reward-model-training.md
m3taversal be8ff41bfe link: bidirectional source↔claim index — 414 claims + 252 sources connected
Wrote sourced_from: into 414 claim files pointing back to their origin source.
Backfilled claims_extracted: into 252 source files that were processed but
missing this field. Matching uses author+title overlap against claim source:
field, validated against 296 known-good pairs from existing claims_extracted.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-21 11:55:18 +01:00

3.8 KiB

type domain secondary_domains description confidence source created related reweave_edges supports sourced_from
claim ai-alignment
mechanisms
The aggregated rankings variant of RLCHF applies formal social choice functions to combine multiple evaluator rankings before training the reward model experimental Conitzer et al. (2024), 'Social Choice Should Guide AI Alignment' (ICML 2024) 2026-03-11
rlchf-features-based-variant-models-individual-preferences-with-evaluator-characteristics-enabling-aggregation-across-diverse-groups
rlchf-features-based-variant-models-individual-preferences-with-evaluator-characteristics-enabling-aggregation-across-diverse-groups|related|2026-03-28
rlhf-is-implicit-social-choice-without-normative-scrutiny|supports|2026-03-28
rlhf-is-implicit-social-choice-without-normative-scrutiny
inbox/archive/ai-alignment/2024-04-00-conitzer-social-choice-guide-alignment.md

RLCHF aggregated rankings variant combines evaluator rankings via social welfare function before reward model training

Conitzer et al. (2024) propose Reinforcement Learning from Collective Human Feedback (RLCHF) as a formalization of preference aggregation in AI alignment. The aggregated rankings variant works by: (1) collecting rankings of AI responses from multiple evaluators, (2) combining these rankings using a formal social welfare function (e.g., Borda Count, Ranked Pairs), (3) training the reward model on the aggregated ranking rather than individual preferences.

This approach makes the social choice decision explicit and auditable. Instead of implicitly aggregating through dataset composition or reward model averaging, the aggregation happens at the ranking level using well-studied voting methods with known properties.

The key architectural choice: aggregation happens before reward model training, not during or after. This means the reward model learns from a collective preference signal rather than trying to learn individual preferences and aggregate them internally.

Evidence

  • Conitzer et al. (2024) describe two RLCHF variants; this is the first
  • The paper recommends specific social welfare functions: Borda Count, Instant Runoff, Ranked Pairs
  • This approach connects to 70+ years of social choice theory on voting methods

Comparison to Standard RLHF

Standard RLHF typically aggregates preferences implicitly through:

  • Dataset composition (which evaluators are included)
  • Majority voting on pairwise comparisons
  • Averaging reward model predictions

RLCHF makes this aggregation explicit and allows practitioners to choose aggregation methods based on their normative properties rather than computational convenience.

Relationship to Existing Work

This mechanism directly addresses the failure mode identified in RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values. By aggregating at the ranking level with formal social choice functions, RLCHF preserves more information about preference diversity than collapsing to a single reward function.

The approach also connects to modeling preference sensitivity as a learned distribution rather than a fixed scalar resolves DPO diversity failures without demographic labels or explicit user modeling—both are attempts to handle preference heterogeneity more formally.


Relevant Notes:

Topics:

  • domains/ai-alignment/_map
  • core/mechanisms/_map