teleo-codex/domains/ai-alignment/rlchf-features-based-variant-models-individual-preferences-with-evaluator-characteristics-enabling-aggregation-across-diverse-groups.md
Teleo Pipeline e4506bd6ce
Some checks are pending
Sync Graph Data to teleo-app / sync (push) Waiting to run
extract: 2024-04-00-conitzer-social-choice-guide-alignment
Pentagon-Agent: Ganymede <F99EBFA6-547B-4096-BEEA-1D59C3E4028A>
2026-03-15 17:13:21 +00:00

50 lines
No EOL
3.4 KiB
Markdown

---
type: claim
domain: ai-alignment
secondary_domains: [mechanisms]
description: "The features-based RLCHF variant learns individual preference models that incorporate evaluator characteristics allowing aggregation across demographic or value-based groups"
confidence: experimental
source: "Conitzer et al. (2024), 'Social Choice Should Guide AI Alignment' (ICML 2024)"
created: 2026-03-11
---
# RLCHF features-based variant models individual preferences with evaluator characteristics enabling aggregation across diverse groups
The second RLCHF variant proposed by Conitzer et al. (2024) takes a different approach: instead of aggregating rankings directly, it builds individual preference models that incorporate evaluator characteristics (demographics, values, context). These models can then be aggregated across groups, enabling context-sensitive preference aggregation.
This approach allows the system to learn: "People with characteristic X tend to prefer response type Y in context Z." Aggregation then happens by weighting or combining these learned preference functions according to a social choice rule, rather than aggregating raw rankings.
The key advantage: this variant can handle preference heterogeneity more flexibly than the aggregated rankings variant. It can adapt aggregation based on context, represent minority preferences explicitly, and enable "what would group X prefer?" queries.
## Evidence
- Conitzer et al. (2024) describe this as the second RLCHF variant
- The paper notes this approach "incorporates evaluator characteristics" and enables "aggregation across diverse groups"
- This connects to the broader literature on personalized and pluralistic AI systems
## Comparison to Aggregated Rankings Variant
Where the aggregated rankings variant collapses preferences into a single collective ranking before training, the features-based variant preserves preference structure throughout. This allows:
- Context-dependent aggregation (different social choice rules for different situations)
- Explicit representation of minority preferences
- Transparency about which groups prefer which responses
The tradeoff: higher complexity and potential for misuse (e.g., demographic profiling, value discrimination).
## Relationship to Existing Work
This approach is conceptually similar to [[modeling preference sensitivity as a learned distribution rather than a fixed scalar resolves DPO diversity failures without demographic labels or explicit user modeling]], but more explicit about incorporating evaluator features. Both recognize that preference heterogeneity is structural, not noise.
The features-based variant also connects to [[community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules]]—both emphasize that different communities have different legitimate preferences that should be represented rather than averaged away.
---
Relevant Notes:
- [[modeling preference sensitivity as a learned distribution rather than a fixed scalar resolves DPO diversity failures without demographic labels or explicit user modeling]]
- [[community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules]]
- [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]]
Topics:
- domains/ai-alignment/_map
- core/mechanisms/_map
- foundations/collective-intelligence/_map