teleo-codex/domains/ai-alignment/rlchf-aggregated-rankings-variant-combines-evaluator-rankings-via-social-welfare-function-before-reward-model-training.md at e4fb0b75a3c8ad4b31f4e6ea49dc5d782ef65efa

m3taversal be8ff41bfe link: bidirectional source↔claim index — 414 claims + 252 sources connected

Wrote sourced_from: into 414 claim files pointing back to their origin source.
Backfilled claims_extracted: into 252 source files that were processed but
missing this field. Matching uses author+title overlap against claim source:
field, validated against 296 known-good pairs from existing claims_extracted.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-21 11:55:18 +01:00

3.8 KiB

Raw Blame History

type

domain

secondary_domains

description

confidence

source

created

reweave_edges

supports

sourced_from

claim

ai-alignment

mechanisms

The aggregated rankings variant of RLCHF applies formal social choice functions to combine multiple evaluator rankings before training the reward model

experimental

Conitzer et al. (2024), 'Social Choice Should Guide AI Alignment' (ICML 2024)

2026-03-11

rlchf-features-based-variant-models-individual-preferences-with-evaluator-characteristics-enabling-aggregation-across-diverse-groups

rlchf-features-based-variant-models-individual-preferences-with-evaluator-characteristics-enabling-aggregation-across-diverse-groups|related|2026-03-28

rlhf-is-implicit-social-choice-without-normative-scrutiny|supports|2026-03-28

rlhf-is-implicit-social-choice-without-normative-scrutiny

inbox/archive/ai-alignment/2024-04-00-conitzer-social-choice-guide-alignment.md

Conitzer et al. (2024) propose Reinforcement Learning from Collective Human Feedback (RLCHF) as a formalization of preference aggregation in AI alignment. The aggregated rankings variant works by: (1) collecting rankings of AI responses from multiple evaluators, (2) combining these rankings using a formal social welfare function (e.g., Borda Count, Ranked Pairs), (3) training the reward model on the aggregated ranking rather than individual preferences.

This approach makes the social choice decision explicit and auditable. Instead of implicitly aggregating through dataset composition or reward model averaging, the aggregation happens at the ranking level using well-studied voting methods with known properties.

The key architectural choice: aggregation happens before reward model training, not during or after. This means the reward model learns from a collective preference signal rather than trying to learn individual preferences and aggregate them internally.

Evidence

Conitzer et al. (2024) describe two RLCHF variants; this is the first
The paper recommends specific social welfare functions: Borda Count, Instant Runoff, Ranked Pairs
This approach connects to 70+ years of social choice theory on voting methods

Comparison to Standard RLHF

Standard RLHF typically aggregates preferences implicitly through:

Dataset composition (which evaluators are included)
Majority voting on pairwise comparisons
Averaging reward model predictions

RLCHF makes this aggregation explicit and allows practitioners to choose aggregation methods based on their normative properties rather than computational convenience.

Relationship to Existing Work

This mechanism directly addresses the failure mode identified in RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values. By aggregating at the ranking level with formal social choice functions, RLCHF preserves more information about preference diversity than collapsing to a single reward function.

The approach also connects to modeling preference sensitivity as a learned distribution rather than a fixed scalar resolves DPO diversity failures without demographic labels or explicit user modeling—both are attempts to handle preference heterogeneity more formally.

Relevant Notes:

Topics:

domains/ai-alignment/_map
core/mechanisms/_map

3.8 KiB Raw Blame History

RLCHF aggregated rankings variant combines evaluator rankings via social welfare function before reward model training

Evidence

Comparison to Standard RLHF

Relationship to Existing Work

3.8 KiB

Raw Blame History