teleo-codex/domains/ai-alignment/demographic-composition-of-alignment-training-data-produces-measurable-behavioral-differences-in-llms.md
Teleo Agents e39c76c3c2 theseus: extract claims from 2025-11-00-operationalizing-pluralistic-values-llm-alignment.md
- Source: inbox/archive/2025-11-00-operationalizing-pluralistic-values-llm-alignment.md
- Domain: ai-alignment
- Extracted by: headless extraction cron (worker 5)

Pentagon-Agent: Theseus <HEADLESS>
2026-03-11 09:18:59 +00:00

3.4 KiB

type domain description confidence source created depends_on
claim ai-alignment Demographic composition of human feedback providers materially affects aligned model behavior, with effect sizes of 3-5 percentage points across emotional awareness and toxicity metrics—a magnitude comparable to technical alignment improvements. likely arXiv 2511.14476 - Operationalizing Pluralistic Values in Large Language Model Alignment (2025) 2025-11-01
community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules
some disagreements are permanently irreducible because they stem from genuine value differences not information gaps

Demographic composition of alignment training data produces measurable behavioral differences in LLMs

The demographic makeup of human feedback providers materially affects aligned model behavior. This is not a subtle effect—it is quantitatively significant at 3-5 percentage points, demonstrating that "whose feedback" is as important as "how much feedback" for alignment outcomes.

Evidence

A systematic empirical study (arXiv 2511.14476) collected 27,375 ratings from 1,095 participants, jointly varying demographic composition and technical design:

  • Models fine-tuned on Liberal feedback improved 5.0 percentage points relative to Conservative baseline
  • Models fine-tuned on White feedback improved 4.7 percentage points relative to Black baseline
  • Models fine-tuned on Female feedback improved 3.4 percentage points relative to Male baseline
  • Effects measured across emotional awareness and toxicity dimensions

The study's scale (1,095 participants providing real human feedback, not synthetic) makes this the largest empirical investigation of demographic composition effects in alignment training to date. Critically, identical technical methods applied to different demographic groups produced systematically different model behaviors, proving the effect is not methodological artifact but reflects genuine value differences in the training populations.

Implications

This finding proves that single-population alignment training encodes specific demographic perspectives into model behavior, not universal human values. The magnitude of the effect (3-5 percentage points) is comparable to many technical alignment improvements, which means demographic composition is a first-order variable in alignment outcomes, not a secondary fairness consideration.

The result directly supports the claim that community-centred norm elicitation surfaces materially different alignment targets by demonstrating that different populations surface different targets even when technical methods are held constant. It also confirms that some disagreements are permanently irreducible because they stem from genuine value differences: these differences persist across identical elicitation processes, proving the disagreement is in the values themselves, not the process.


Relevant Notes: