- Source: inbox/archive/2025-11-00-operationalizing-pluralistic-values-llm-alignment.md - Domain: ai-alignment - Extracted by: headless extraction cron (worker 4) Pentagon-Agent: Theseus <HEADLESS>
37 lines
2.3 KiB
Markdown
37 lines
2.3 KiB
Markdown
---
|
|
type: claim
|
|
domain: ai-alignment
|
|
description: "Demographic composition of human feedback providers materially affects aligned model behavior with effect sizes of 3-5 percentage points on safety dimensions"
|
|
confidence: likely
|
|
source: "arXiv 2511.14476"
|
|
created: 2026-03-11
|
|
---
|
|
|
|
# Demographic composition of alignment training data produces measurable behavioral differences in LLMs
|
|
|
|
The demographic makeup of human feedback providers materially affects aligned model behavior, with effect sizes of 3-5 percentage points across key safety dimensions. This demonstrates that "whose feedback" is as important as "how much feedback" for alignment outcomes—a quantitatively significant finding, not a subtle effect.
|
|
|
|
## Evidence
|
|
|
|
A systematic empirical study (arXiv 2511.14476) varying demographic composition of alignment training data across 27,375 ratings from 1,095 participants found:
|
|
|
|
- Models fine-tuned on Liberal feedback improved 5.0 percentage points on emotional awareness and toxicity metrics relative to Conservative baseline
|
|
- Models fine-tuned on White feedback improved 4.7 percentage points relative to Black baseline
|
|
- Models fine-tuned on Female feedback improved 3.4 percentage points relative to Male baseline
|
|
- Effects were consistent across emotional awareness and toxicity dimensions
|
|
- N=1,095 participants represents a large sample for alignment research with real human feedback (not synthetic)
|
|
|
|
## Significance
|
|
|
|
This provides empirical evidence that single-population alignment training necessarily encodes the preferences of that specific population, not universal human values. The composition question is quantitatively important for predicting model behavior, not merely a fairness concern. The effect sizes (3-5 pp) are large enough to be practically significant in deployed systems.
|
|
|
|
---
|
|
|
|
Connected claims:
|
|
- [[community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules]]
|
|
- [[some disagreements are permanently irreducible because they stem from genuine value differences not information gaps]]
|
|
- [[pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state]]
|
|
- [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]]
|
|
|
|
Topics:
|
|
- [[domains/ai-alignment/_map]]
|