teleo-codex/domains/ai-alignment/demographic-composition-of-alignment-training-data-produces-measurable-differences-in-model-behavior.md

---
type: claim
domain: ai-alignment
description: "Empirical study with 1,095 participants shows 3-5 percentage point behavioral shifts based on whose feedback trains the model"
confidence: likely
source: "arXiv 2511.14476, 27,375 ratings from 1,095 participants"
created: 2026-03-11
enrichments:
  - "community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules"
  - "some disagreements are permanently irreducible because they stem from genuine value differences not information gaps"
---

# Demographic composition of alignment training data produces measurable differences in model behavior

A systematic empirical study varying the demographic composition of human feedback in LLM alignment training demonstrates that "whose feedback" matters quantitatively, not just as a fairness concern. Models fine-tuned on feedback from Liberal, White, and Female participants showed improvements of 5.0, 4.7, and 3.4 percentage points respectively, relative to Conservative, Black, and Male baselines, measured across emotional awareness and toxicity dimensions.

## Evidence

The study collected 27,375 ratings from 1,095 participants, jointly varying demographic composition and technical design:

- **Liberal vs Conservative training data**: 5.0 percentage point difference in model behavior
- **White vs Black training data**: 4.7 percentage point difference
- **Female vs Male training data**: 3.4 percentage point difference
- **Measured dimensions**: emotional awareness and toxicity
- **Effect magnitude**: 3-5 percentage points is substantial—this is not a subtle effect that disappears in noise

The study design systematically isolated demographic composition as a variable while controlling for technical design choices, establishing that the composition question in alignment is quantitatively important independent of implementation details.

## Implications

This empirical result transforms the pluralistic alignment debate from a philosophical question about fairness to a quantitative engineering constraint. Any alignment approach that trains on a single demographic population will produce models that systematically differ in behavior by 3-5 percentage points from models trained on other populations.

Single-population alignment training necessarily encodes the preferences of that population into model behavior, with measurable downstream effects on how the model responds to different users and contexts. The effect compounds with existing evidence that community-centered norm elicitation surfaces alignment targets materially different from developer-specified rules—not only do communities surface different norms, but training on those different norms produces measurably different model behavior.

## Related Claims

- [[community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules]] — provides qualitative evidence that different communities surface different norms; this claim quantifies the behavioral magnitude
- [[some disagreements are permanently irreducible because they stem from genuine value differences not information gaps]] — demographic composition effects may reflect irreducible value differences rather than information asymmetries
- [[pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state]] — 3-5pp effects make single-population training inadequate for pluralistic alignment
- [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]] — demographic composition effects are one manifestation of this failure mode

---