teleo-codex/domains/ai-alignment/rlchf-features-based-variant-models-individual-preferences-with-evaluator-characteristics-enabling-aggregation-across-diverse-groups.md

---

type: claim
domain: ai-alignment
secondary_domains: [mechanisms]
description: "The features-based RLCHF variant learns individual preference models that incorporate evaluator characteristics allowing aggregation across demographic or value-based groups"
confidence: experimental
source: "Conitzer et al. (2024), 'Social Choice Should Guide AI Alignment' (ICML 2024)"
created: 2026-03-11
related:
- rlchf-aggregated-rankings-variant-combines-evaluator-rankings-via-social-welfare-function-before-reward-model-training
reweave_edges:
- rlchf-aggregated-rankings-variant-combines-evaluator-rankings-via-social-welfare-function-before-reward-model-training|related|2026-03-28
sourced_from:
- inbox/archive/ai-alignment/2024-04-00-conitzer-social-choice-guide-alignment.md
---

# RLCHF features-based variant models individual preferences with evaluator characteristics enabling aggregation across diverse groups

The second RLCHF variant proposed by Conitzer et al. (2024) takes a different approach: instead of aggregating rankings directly, it builds individual preference models that incorporate evaluator characteristics (demographics, values, context). These models can then be aggregated across groups, enabling context-sensitive preference aggregation.

This approach allows the system to learn: "People with characteristic X tend to prefer response type Y in context Z." Aggregation then happens by weighting or combining these learned preference functions according to a social choice rule, rather than aggregating raw rankings.

The key advantage: this variant can handle preference heterogeneity more flexibly than the aggregated rankings variant. It can adapt aggregation based on context, represent minority preferences explicitly, and enable "what would group X prefer?" queries.

## Evidence

- Conitzer et al. (2024) describe this as the second RLCHF variant
- The paper notes this approach "incorporates evaluator characteristics" and enables "aggregation across diverse groups"
- This connects to the broader literature on personalized and pluralistic AI systems

## Comparison to Aggregated Rankings Variant

Where the aggregated rankings variant collapses preferences into a single collective ranking before training, the features-based variant preserves preference structure throughout. This allows:
- Context-dependent aggregation (different social choice rules for different situations)
- Explicit representation of minority preferences
- Transparency about which groups prefer which responses

The tradeoff: higher complexity and potential for misuse (e.g., demographic profiling, value discrimination).

## Relationship to Existing Work

This approach is conceptually similar to [[modeling preference sensitivity as a learned distribution rather than a fixed scalar resolves DPO diversity failures without demographic labels or explicit user modeling]], but more explicit about incorporating evaluator features. Both recognize that preference heterogeneity is structural, not noise.

The features-based variant also connects to [[community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules]]—both emphasize that different communities have different legitimate preferences that should be represented rather than averaged away.

---

Relevant Notes:
- [[modeling preference sensitivity as a learned distribution rather than a fixed scalar resolves DPO diversity failures without demographic labels or explicit user modeling]]
- [[community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules]]
- [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]]

Topics:
- domains/ai-alignment/_map
- core/mechanisms/_map
- foundations/collective-intelligence/_map