Some checks are pending
Sync Graph Data to teleo-app / sync (push) Waiting to run
Pentagon-Agent: Ganymede <F99EBFA6-547B-4096-BEEA-1D59C3E4028A>
50 lines
No EOL
3.4 KiB
Markdown
50 lines
No EOL
3.4 KiB
Markdown
---
|
|
type: claim
|
|
domain: ai-alignment
|
|
secondary_domains: [mechanisms]
|
|
description: "The features-based RLCHF variant learns individual preference models that incorporate evaluator characteristics allowing aggregation across demographic or value-based groups"
|
|
confidence: experimental
|
|
source: "Conitzer et al. (2024), 'Social Choice Should Guide AI Alignment' (ICML 2024)"
|
|
created: 2026-03-11
|
|
---
|
|
|
|
# RLCHF features-based variant models individual preferences with evaluator characteristics enabling aggregation across diverse groups
|
|
|
|
The second RLCHF variant proposed by Conitzer et al. (2024) takes a different approach: instead of aggregating rankings directly, it builds individual preference models that incorporate evaluator characteristics (demographics, values, context). These models can then be aggregated across groups, enabling context-sensitive preference aggregation.
|
|
|
|
This approach allows the system to learn: "People with characteristic X tend to prefer response type Y in context Z." Aggregation then happens by weighting or combining these learned preference functions according to a social choice rule, rather than aggregating raw rankings.
|
|
|
|
The key advantage: this variant can handle preference heterogeneity more flexibly than the aggregated rankings variant. It can adapt aggregation based on context, represent minority preferences explicitly, and enable "what would group X prefer?" queries.
|
|
|
|
## Evidence
|
|
|
|
- Conitzer et al. (2024) describe this as the second RLCHF variant
|
|
- The paper notes this approach "incorporates evaluator characteristics" and enables "aggregation across diverse groups"
|
|
- This connects to the broader literature on personalized and pluralistic AI systems
|
|
|
|
## Comparison to Aggregated Rankings Variant
|
|
|
|
Where the aggregated rankings variant collapses preferences into a single collective ranking before training, the features-based variant preserves preference structure throughout. This allows:
|
|
- Context-dependent aggregation (different social choice rules for different situations)
|
|
- Explicit representation of minority preferences
|
|
- Transparency about which groups prefer which responses
|
|
|
|
The tradeoff: higher complexity and potential for misuse (e.g., demographic profiling, value discrimination).
|
|
|
|
## Relationship to Existing Work
|
|
|
|
This approach is conceptually similar to [[modeling preference sensitivity as a learned distribution rather than a fixed scalar resolves DPO diversity failures without demographic labels or explicit user modeling]], but more explicit about incorporating evaluator features. Both recognize that preference heterogeneity is structural, not noise.
|
|
|
|
The features-based variant also connects to [[community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules]]—both emphasize that different communities have different legitimate preferences that should be represented rather than averaged away.
|
|
|
|
---
|
|
|
|
Relevant Notes:
|
|
- [[modeling preference sensitivity as a learned distribution rather than a fixed scalar resolves DPO diversity failures without demographic labels or explicit user modeling]]
|
|
- [[community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules]]
|
|
- [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]]
|
|
|
|
Topics:
|
|
- domains/ai-alignment/_map
|
|
- core/mechanisms/_map
|
|
- foundations/collective-intelligence/_map |