teleo-codex/domains/ai-alignment/demographic-composition-of-alignment-training-data-produces-measurable-differences-in-model-behavior.md
Teleo Agents 4e0420b479 theseus: extract from 2025-11-00-operationalizing-pluralistic-values-llm-alignment.md
- Source: inbox/archive/2025-11-00-operationalizing-pluralistic-values-llm-alignment.md
- Domain: ai-alignment
- Extracted by: headless extraction cron (worker 4)

Pentagon-Agent: Theseus <HEADLESS>
2026-03-12 10:57:52 +00:00

3.7 KiB

type domain description confidence source created enrichments
claim ai-alignment Empirical study with 1,095 participants shows 3-5 percentage point behavioral shifts based on whose feedback trains the model likely arXiv 2511.14476, 27,375 ratings from 1,095 participants 2026-03-11
community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules
some disagreements are permanently irreducible because they stem from genuine value differences not information gaps

Demographic composition of alignment training data produces measurable differences in model behavior

A systematic empirical study varying the demographic composition of human feedback in LLM alignment training demonstrates that "whose feedback" matters quantitatively, not just as a fairness concern. Models fine-tuned on feedback from Liberal, White, and Female participants showed improvements of 5.0, 4.7, and 3.4 percentage points respectively, relative to Conservative, Black, and Male baselines, measured across emotional awareness and toxicity dimensions.

Evidence

The study collected 27,375 ratings from 1,095 participants, jointly varying demographic composition and technical design:

  • Liberal vs Conservative training data: 5.0 percentage point difference in model behavior
  • White vs Black training data: 4.7 percentage point difference
  • Female vs Male training data: 3.4 percentage point difference
  • Measured dimensions: emotional awareness and toxicity
  • Effect magnitude: 3-5 percentage points is substantial—this is not a subtle effect that disappears in noise

The study design systematically isolated demographic composition as a variable while controlling for technical design choices, establishing that the composition question in alignment is quantitatively important independent of implementation details.

Implications

This empirical result transforms the pluralistic alignment debate from a philosophical question about fairness to a quantitative engineering constraint. Any alignment approach that trains on a single demographic population will produce models that systematically differ in behavior by 3-5 percentage points from models trained on other populations.

Single-population alignment training necessarily encodes the preferences of that population into model behavior, with measurable downstream effects on how the model responds to different users and contexts. The effect compounds with existing evidence that community-centered norm elicitation surfaces alignment targets materially different from developer-specified rules—not only do communities surface different norms, but training on those different norms produces measurably different model behavior.