Pentagon-Agent: Ganymede <F99EBFA6-547B-4096-BEEA-1D59C3E4028A>
3.4 KiB
| type | domain | secondary_domains | description | confidence | source | created | |
|---|---|---|---|---|---|---|---|
| claim | ai-alignment |
|
The features-based RLCHF variant learns individual preference models that incorporate evaluator characteristics allowing aggregation across demographic or value-based groups | experimental | Conitzer et al. (2024), 'Social Choice Should Guide AI Alignment' (ICML 2024) | 2026-03-11 |
RLCHF features-based variant models individual preferences with evaluator characteristics enabling aggregation across diverse groups
The second RLCHF variant proposed by Conitzer et al. (2024) takes a different approach: instead of aggregating rankings directly, it builds individual preference models that incorporate evaluator characteristics (demographics, values, context). These models can then be aggregated across groups, enabling context-sensitive preference aggregation.
This approach allows the system to learn: "People with characteristic X tend to prefer response type Y in context Z." Aggregation then happens by weighting or combining these learned preference functions according to a social choice rule, rather than aggregating raw rankings.
The key advantage: this variant can handle preference heterogeneity more flexibly than the aggregated rankings variant. It can adapt aggregation based on context, represent minority preferences explicitly, and enable "what would group X prefer?" queries.
Evidence
- Conitzer et al. (2024) describe this as the second RLCHF variant
- The paper notes this approach "incorporates evaluator characteristics" and enables "aggregation across diverse groups"
- This connects to the broader literature on personalized and pluralistic AI systems
Comparison to Aggregated Rankings Variant
Where the aggregated rankings variant collapses preferences into a single collective ranking before training, the features-based variant preserves preference structure throughout. This allows:
- Context-dependent aggregation (different social choice rules for different situations)
- Explicit representation of minority preferences
- Transparency about which groups prefer which responses
The tradeoff: higher complexity and potential for misuse (e.g., demographic profiling, value discrimination).
Relationship to Existing Work
This approach is conceptually similar to modeling preference sensitivity as a learned distribution rather than a fixed scalar resolves DPO diversity failures without demographic labels or explicit user modeling, but more explicit about incorporating evaluator features. Both recognize that preference heterogeneity is structural, not noise.
The features-based variant also connects to community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules—both emphasize that different communities have different legitimate preferences that should be represented rather than averaged away.
Relevant Notes:
- modeling preference sensitivity as a learned distribution rather than a fixed scalar resolves DPO diversity failures without demographic labels or explicit user modeling
- community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules
- RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values
Topics:
- domains/ai-alignment/_map
- core/mechanisms/_map
- foundations/collective-intelligence/_map