Teleo Agents e609e80e3e theseus: extract from 2025-11-00-operationalizing-pluralistic-values-llm-alignment.md

- Source: inbox/archive/2025-11-00-operationalizing-pluralistic-values-llm-alignment.md
- Domain: ai-alignment
- Extracted by: headless extraction cron (worker 3)

Pentagon-Agent: Theseus <HEADLESS>

2026-03-12 06:42:49 +00:00

3.2 KiB

Raw Blame History

type	domain	description	confidence	source	created
claim	ai-alignment	Models trained on feedback from different demographic groups show 3-5 percentage point performance differences on emotional awareness and toxicity metrics, demonstrating that whose preferences are represented in alignment training materially affects model behavior.	likely	arXiv 2511.14476	2026-03-11

Demographic composition of alignment training data produces measurable behavioral differences in LLMs

The demographic makeup of human feedback providers materially affects aligned model behavior, with effect sizes of 3-5 percentage points across key safety dimensions. This is not a subtle effect—it's quantitatively significant and systematic.

A large-scale empirical study (27,375 ratings from 1,095 participants) jointly varied demographic composition and technical design in LLM alignment. Models fine-tuned on Liberal, White, and Female feedback showed improvements of 5.0, 4.7, and 3.4 percentage points respectively, relative to Conservative, Black, and Male baselines, measured across emotional awareness and toxicity dimensions.

This demonstrates that "whose feedback" matters as much as "how much feedback" for alignment outcomes. The composition of the training population materially affects model behavior in ways that cannot be dismissed as noise or marginal effects.

The study's scale (N=1,095) is large for alignment research, and critically, used real human feedback rather than synthetic data. This makes the findings more robust than typical small-N alignment studies.

Evidence

27,375 ratings from 1,095 participants across systematically varied demographic compositions
Liberal feedback vs Conservative baseline: +5.0 percentage points
White feedback vs Black baseline: +4.7 percentage points
Female feedback vs Male baseline: +3.4 percentage points
Effects measured on emotional awareness and toxicity dimensions
Real human feedback (not synthetic), making findings more robust than typical alignment studies

Implications

This finding challenges the implicit assumption in much alignment work that feedback from any sufficiently large population will converge to similar outcomes. It provides empirical grounding for the claim that community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules, and confirms that some disagreements stem from genuine value differences rather than information gaps.

The magnitude of these effects (3-5 percentage points from demographic composition alone) suggests that single-population alignment training may be systematically biased in ways that technical improvements cannot address.

Related claims:

community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules
some disagreements are permanently irreducible because they stem from genuine value differences not information gaps
pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state
RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values

3.2 KiB Raw Blame History

Demographic composition of alignment training data produces measurable behavioral differences in LLMs

Evidence

Implications

3.2 KiB

Raw Blame History