--- type: claim domain: ai-alignment secondary_domains: [mechanisms] description: "The features-based RLCHF variant learns individual preference models that incorporate evaluator characteristics allowing aggregation across demographic or value-based groups" confidence: experimental source: "Conitzer et al. (2024), 'Social Choice Should Guide AI Alignment' (ICML 2024)" created: 2026-03-11 --- # RLCHF features-based variant models individual preferences with evaluator characteristics enabling aggregation across diverse groups The second RLCHF variant proposed by Conitzer et al. (2024) takes a different approach: instead of aggregating rankings directly, it builds individual preference models that incorporate evaluator characteristics (demographics, values, context). These models can then be aggregated across groups, enabling context-sensitive preference aggregation. This approach allows the system to learn: "People with characteristic X tend to prefer response type Y in context Z." Aggregation then happens by weighting or combining these learned preference functions according to a social choice rule, rather than aggregating raw rankings. The key advantage: this variant can handle preference heterogeneity more flexibly than the aggregated rankings variant. It can adapt aggregation based on context, represent minority preferences explicitly, and enable "what would group X prefer?" queries. ## Evidence - Conitzer et al. (2024) describe this as the second RLCHF variant - The paper notes this approach "incorporates evaluator characteristics" and enables "aggregation across diverse groups" - This connects to the broader literature on personalized and pluralistic AI systems ## Comparison to Aggregated Rankings Variant Where the aggregated rankings variant collapses preferences into a single collective ranking before training, the features-based variant preserves preference structure throughout. This allows: - Context-dependent aggregation (different social choice rules for different situations) - Explicit representation of minority preferences - Transparency about which groups prefer which responses The tradeoff: higher complexity and potential for misuse (e.g., demographic profiling, value discrimination). ## Relationship to Existing Work This approach is conceptually similar to [[modeling preference sensitivity as a learned distribution rather than a fixed scalar resolves DPO diversity failures without demographic labels or explicit user modeling]], but more explicit about incorporating evaluator features. Both recognize that preference heterogeneity is structural, not noise. The features-based variant also connects to [[community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules]]—both emphasize that different communities have different legitimate preferences that should be represented rather than averaged away. --- Relevant Notes: - [[modeling preference sensitivity as a learned distribution rather than a fixed scalar resolves DPO diversity failures without demographic labels or explicit user modeling]] - [[community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules]] - [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]] Topics: - domains/ai-alignment/_map - core/mechanisms/_map - foundations/collective-intelligence/_map