teleo-codex/domains/ai-alignment/rlchf-features-based-variant-models-individual-preferences-with-evaluator-characteristics-enabling-aggregation-across-diverse-groups.md
m3taversal be8ff41bfe link: bidirectional source↔claim index — 414 claims + 252 sources connected
Wrote sourced_from: into 414 claim files pointing back to their origin source.
Backfilled claims_extracted: into 252 source files that were processed but
missing this field. Matching uses author+title overlap against claim source:
field, validated against 296 known-good pairs from existing claims_extracted.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-21 11:55:18 +01:00

57 lines
No EOL
3.8 KiB
Markdown

---
type: claim
domain: ai-alignment
secondary_domains: [mechanisms]
description: "The features-based RLCHF variant learns individual preference models that incorporate evaluator characteristics allowing aggregation across demographic or value-based groups"
confidence: experimental
source: "Conitzer et al. (2024), 'Social Choice Should Guide AI Alignment' (ICML 2024)"
created: 2026-03-11
related:
- rlchf-aggregated-rankings-variant-combines-evaluator-rankings-via-social-welfare-function-before-reward-model-training
reweave_edges:
- rlchf-aggregated-rankings-variant-combines-evaluator-rankings-via-social-welfare-function-before-reward-model-training|related|2026-03-28
sourced_from:
- inbox/archive/ai-alignment/2024-04-00-conitzer-social-choice-guide-alignment.md
---
# RLCHF features-based variant models individual preferences with evaluator characteristics enabling aggregation across diverse groups
The second RLCHF variant proposed by Conitzer et al. (2024) takes a different approach: instead of aggregating rankings directly, it builds individual preference models that incorporate evaluator characteristics (demographics, values, context). These models can then be aggregated across groups, enabling context-sensitive preference aggregation.
This approach allows the system to learn: "People with characteristic X tend to prefer response type Y in context Z." Aggregation then happens by weighting or combining these learned preference functions according to a social choice rule, rather than aggregating raw rankings.
The key advantage: this variant can handle preference heterogeneity more flexibly than the aggregated rankings variant. It can adapt aggregation based on context, represent minority preferences explicitly, and enable "what would group X prefer?" queries.
## Evidence
- Conitzer et al. (2024) describe this as the second RLCHF variant
- The paper notes this approach "incorporates evaluator characteristics" and enables "aggregation across diverse groups"
- This connects to the broader literature on personalized and pluralistic AI systems
## Comparison to Aggregated Rankings Variant
Where the aggregated rankings variant collapses preferences into a single collective ranking before training, the features-based variant preserves preference structure throughout. This allows:
- Context-dependent aggregation (different social choice rules for different situations)
- Explicit representation of minority preferences
- Transparency about which groups prefer which responses
The tradeoff: higher complexity and potential for misuse (e.g., demographic profiling, value discrimination).
## Relationship to Existing Work
This approach is conceptually similar to [[modeling preference sensitivity as a learned distribution rather than a fixed scalar resolves DPO diversity failures without demographic labels or explicit user modeling]], but more explicit about incorporating evaluator features. Both recognize that preference heterogeneity is structural, not noise.
The features-based variant also connects to [[community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules]]—both emphasize that different communities have different legitimate preferences that should be represented rather than averaged away.
---
Relevant Notes:
- [[modeling preference sensitivity as a learned distribution rather than a fixed scalar resolves DPO diversity failures without demographic labels or explicit user modeling]]
- [[community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules]]
- [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]]
Topics:
- domains/ai-alignment/_map
- core/mechanisms/_map
- foundations/collective-intelligence/_map