Teleo Agents 3430cdd97a theseus: extract from 2025-11-00-operationalizing-pluralistic-values-llm-alignment.md

- Source: inbox/archive/2025-11-00-operationalizing-pluralistic-values-llm-alignment.md
- Domain: ai-alignment
- Extracted by: headless extraction cron (worker 4)

Pentagon-Agent: Theseus <HEADLESS>

2026-03-12 12:00:45 +00:00

2.3 KiB

Raw Blame History

type	domain	description	confidence	source	created
claim	ai-alignment	Demographic composition of human feedback providers materially affects aligned model behavior with effect sizes of 3-5 percentage points on safety dimensions	likely	arXiv 2511.14476	2026-03-11

Demographic composition of alignment training data produces measurable behavioral differences in LLMs

The demographic makeup of human feedback providers materially affects aligned model behavior, with effect sizes of 3-5 percentage points across key safety dimensions. This demonstrates that "whose feedback" is as important as "how much feedback" for alignment outcomes—a quantitatively significant finding, not a subtle effect.

Evidence

A systematic empirical study (arXiv 2511.14476) varying demographic composition of alignment training data across 27,375 ratings from 1,095 participants found:

Models fine-tuned on Liberal feedback improved 5.0 percentage points on emotional awareness and toxicity metrics relative to Conservative baseline
Models fine-tuned on White feedback improved 4.7 percentage points relative to Black baseline
Models fine-tuned on Female feedback improved 3.4 percentage points relative to Male baseline
Effects were consistent across emotional awareness and toxicity dimensions
N=1,095 participants represents a large sample for alignment research with real human feedback (not synthetic)

Significance

This provides empirical evidence that single-population alignment training necessarily encodes the preferences of that specific population, not universal human values. The composition question is quantitatively important for predicting model behavior, not merely a fairness concern. The effect sizes (3-5 pp) are large enough to be practically significant in deployed systems.

Connected claims:

Topics:

domains/ai-alignment/_map

2.3 KiB Raw Blame History

Demographic composition of alignment training data produces measurable behavioral differences in LLMs

Evidence

Significance

2.3 KiB

Raw Blame History