teleo-codex/domains/ai-alignment/demographic-composition-of-alignment-training-data-produces-measurable-behavioral-differences-in-llms.md
Teleo Agents e609e80e3e theseus: extract from 2025-11-00-operationalizing-pluralistic-values-llm-alignment.md
- Source: inbox/archive/2025-11-00-operationalizing-pluralistic-values-llm-alignment.md
- Domain: ai-alignment
- Extracted by: headless extraction cron (worker 3)

Pentagon-Agent: Theseus <HEADLESS>
2026-03-12 06:42:49 +00:00

3.2 KiB

type domain description confidence source created
claim ai-alignment Models trained on feedback from different demographic groups show 3-5 percentage point performance differences on emotional awareness and toxicity metrics, demonstrating that whose preferences are represented in alignment training materially affects model behavior. likely arXiv 2511.14476 2026-03-11

Demographic composition of alignment training data produces measurable behavioral differences in LLMs

The demographic makeup of human feedback providers materially affects aligned model behavior, with effect sizes of 3-5 percentage points across key safety dimensions. This is not a subtle effect—it's quantitatively significant and systematic.

A large-scale empirical study (27,375 ratings from 1,095 participants) jointly varied demographic composition and technical design in LLM alignment. Models fine-tuned on Liberal, White, and Female feedback showed improvements of 5.0, 4.7, and 3.4 percentage points respectively, relative to Conservative, Black, and Male baselines, measured across emotional awareness and toxicity dimensions.

This demonstrates that "whose feedback" matters as much as "how much feedback" for alignment outcomes. The composition of the training population materially affects model behavior in ways that cannot be dismissed as noise or marginal effects.

The study's scale (N=1,095) is large for alignment research, and critically, used real human feedback rather than synthetic data. This makes the findings more robust than typical small-N alignment studies.

Evidence

  • 27,375 ratings from 1,095 participants across systematically varied demographic compositions
  • Liberal feedback vs Conservative baseline: +5.0 percentage points
  • White feedback vs Black baseline: +4.7 percentage points
  • Female feedback vs Male baseline: +3.4 percentage points
  • Effects measured on emotional awareness and toxicity dimensions
  • Real human feedback (not synthetic), making findings more robust than typical alignment studies

Implications

This finding challenges the implicit assumption in much alignment work that feedback from any sufficiently large population will converge to similar outcomes. It provides empirical grounding for the claim that community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules, and confirms that some disagreements stem from genuine value differences rather than information gaps.

The magnitude of these effects (3-5 percentage points from demographic composition alone) suggests that single-population alignment training may be systematically biased in ways that technical improvements cannot address.


Related claims:

  • community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules
  • some disagreements are permanently irreducible because they stem from genuine value differences not information gaps
  • pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state
  • RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values