teleo-codex/domains/ai-alignment/rlchf-features-based-variant-models-individual-preferences-with-evaluator-characteristics-enabling-aggregation-across-diverse-groups.md
Teleo Pipeline db5bbf3eb7 reweave: connect 48 orphan claims via vector similarity
Threshold: 0.7, Haiku classification, 80 files modified.

Pentagon-Agent: Epimetheus <0144398e-4ed3-4fe2-95a3-3d72e1abf887>
2026-03-28 23:04:53 +00:00

3.7 KiB

type domain secondary_domains description confidence source created related reweave_edges
claim ai-alignment
mechanisms
The features-based RLCHF variant learns individual preference models that incorporate evaluator characteristics allowing aggregation across demographic or value-based groups experimental Conitzer et al. (2024), 'Social Choice Should Guide AI Alignment' (ICML 2024) 2026-03-11
rlchf aggregated rankings variant combines evaluator rankings via social welfare function before reward model training
rlchf aggregated rankings variant combines evaluator rankings via social welfare function before reward model training|related|2026-03-28

RLCHF features-based variant models individual preferences with evaluator characteristics enabling aggregation across diverse groups

The second RLCHF variant proposed by Conitzer et al. (2024) takes a different approach: instead of aggregating rankings directly, it builds individual preference models that incorporate evaluator characteristics (demographics, values, context). These models can then be aggregated across groups, enabling context-sensitive preference aggregation.

This approach allows the system to learn: "People with characteristic X tend to prefer response type Y in context Z." Aggregation then happens by weighting or combining these learned preference functions according to a social choice rule, rather than aggregating raw rankings.

The key advantage: this variant can handle preference heterogeneity more flexibly than the aggregated rankings variant. It can adapt aggregation based on context, represent minority preferences explicitly, and enable "what would group X prefer?" queries.

Evidence

  • Conitzer et al. (2024) describe this as the second RLCHF variant
  • The paper notes this approach "incorporates evaluator characteristics" and enables "aggregation across diverse groups"
  • This connects to the broader literature on personalized and pluralistic AI systems

Comparison to Aggregated Rankings Variant

Where the aggregated rankings variant collapses preferences into a single collective ranking before training, the features-based variant preserves preference structure throughout. This allows:

  • Context-dependent aggregation (different social choice rules for different situations)
  • Explicit representation of minority preferences
  • Transparency about which groups prefer which responses

The tradeoff: higher complexity and potential for misuse (e.g., demographic profiling, value discrimination).

Relationship to Existing Work

This approach is conceptually similar to modeling preference sensitivity as a learned distribution rather than a fixed scalar resolves DPO diversity failures without demographic labels or explicit user modeling, but more explicit about incorporating evaluator features. Both recognize that preference heterogeneity is structural, not noise.

The features-based variant also connects to community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules—both emphasize that different communities have different legitimate preferences that should be represented rather than averaged away.


Relevant Notes:

Topics:

  • domains/ai-alignment/_map
  • core/mechanisms/_map
  • foundations/collective-intelligence/_map