Sync Graph Data to teleo-app / sync (push) Waiting to run

Details

extract: 2024-04-00-conitzer-social-choice-guide-alignment

Pentagon-Agent: Ganymede <F99EBFA6-547B-4096-BEEA-1D59C3E4028A>

2026-03-15 17:13:21 +00:00

3.4 KiB

Raw Blame History

type

domain

secondary_domains

description

confidence

source

created

claim

ai-alignment

mechanisms

The features-based RLCHF variant learns individual preference models that incorporate evaluator characteristics allowing aggregation across demographic or value-based groups

experimental

Conitzer et al. (2024), 'Social Choice Should Guide AI Alignment' (ICML 2024)

2026-03-11

RLCHF features-based variant models individual preferences with evaluator characteristics enabling aggregation across diverse groups

The second RLCHF variant proposed by Conitzer et al. (2024) takes a different approach: instead of aggregating rankings directly, it builds individual preference models that incorporate evaluator characteristics (demographics, values, context). These models can then be aggregated across groups, enabling context-sensitive preference aggregation.

This approach allows the system to learn: "People with characteristic X tend to prefer response type Y in context Z." Aggregation then happens by weighting or combining these learned preference functions according to a social choice rule, rather than aggregating raw rankings.

The key advantage: this variant can handle preference heterogeneity more flexibly than the aggregated rankings variant. It can adapt aggregation based on context, represent minority preferences explicitly, and enable "what would group X prefer?" queries.

Evidence

Conitzer et al. (2024) describe this as the second RLCHF variant
The paper notes this approach "incorporates evaluator characteristics" and enables "aggregation across diverse groups"
This connects to the broader literature on personalized and pluralistic AI systems

Comparison to Aggregated Rankings Variant

Where the aggregated rankings variant collapses preferences into a single collective ranking before training, the features-based variant preserves preference structure throughout. This allows:

Context-dependent aggregation (different social choice rules for different situations)
Explicit representation of minority preferences
Transparency about which groups prefer which responses

The tradeoff: higher complexity and potential for misuse (e.g., demographic profiling, value discrimination).

Relationship to Existing Work

This approach is conceptually similar to modeling preference sensitivity as a learned distribution rather than a fixed scalar resolves DPO diversity failures without demographic labels or explicit user modeling, but more explicit about incorporating evaluator features. Both recognize that preference heterogeneity is structural, not noise.

The features-based variant also connects to community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules—both emphasize that different communities have different legitimate preferences that should be represented rather than averaged away.

Relevant Notes:

Topics:

domains/ai-alignment/_map
core/mechanisms/_map
foundations/collective-intelligence/_map

3.4 KiB Raw Blame History

RLCHF features-based variant models individual preferences with evaluator characteristics enabling aggregation across diverse groups

Evidence

Comparison to Aggregated Rankings Variant

Relationship to Existing Work

3.4 KiB

Raw Blame History