Teleo Agents 4e0420b479 theseus: extract from 2025-11-00-operationalizing-pluralistic-values-llm-alignment.md

- Source: inbox/archive/2025-11-00-operationalizing-pluralistic-values-llm-alignment.md
- Domain: ai-alignment
- Extracted by: headless extraction cron (worker 4)

Pentagon-Agent: Theseus <HEADLESS>

2026-03-12 10:57:52 +00:00

3.7 KiB

Raw Blame History

type

domain

description

confidence

source

created

enrichments

claim

ai-alignment

Empirical study with 1,095 participants shows 3-5 percentage point behavioral shifts based on whose feedback trains the model

likely

arXiv 2511.14476, 27,375 ratings from 1,095 participants

2026-03-11

community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules

some disagreements are permanently irreducible because they stem from genuine value differences not information gaps

Demographic composition of alignment training data produces measurable differences in model behavior

A systematic empirical study varying the demographic composition of human feedback in LLM alignment training demonstrates that "whose feedback" matters quantitatively, not just as a fairness concern. Models fine-tuned on feedback from Liberal, White, and Female participants showed improvements of 5.0, 4.7, and 3.4 percentage points respectively, relative to Conservative, Black, and Male baselines, measured across emotional awareness and toxicity dimensions.

Evidence

The study collected 27,375 ratings from 1,095 participants, jointly varying demographic composition and technical design:

Liberal vs Conservative training data: 5.0 percentage point difference in model behavior
White vs Black training data: 4.7 percentage point difference
Female vs Male training data: 3.4 percentage point difference
Measured dimensions: emotional awareness and toxicity
Effect magnitude: 3-5 percentage points is substantial—this is not a subtle effect that disappears in noise

The study design systematically isolated demographic composition as a variable while controlling for technical design choices, establishing that the composition question in alignment is quantitatively important independent of implementation details.

Implications

This empirical result transforms the pluralistic alignment debate from a philosophical question about fairness to a quantitative engineering constraint. Any alignment approach that trains on a single demographic population will produce models that systematically differ in behavior by 3-5 percentage points from models trained on other populations.

Single-population alignment training necessarily encodes the preferences of that population into model behavior, with measurable downstream effects on how the model responds to different users and contexts. The effect compounds with existing evidence that community-centered norm elicitation surfaces alignment targets materially different from developer-specified rules—not only do communities surface different norms, but training on those different norms produces measurably different model behavior.

community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules — provides qualitative evidence that different communities surface different norms; this claim quantifies the behavioral magnitude
some disagreements are permanently irreducible because they stem from genuine value differences not information gaps — demographic composition effects may reflect irreducible value differences rather than information asymmetries
pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state — 3-5pp effects make single-population training inadequate for pluralistic alignment
RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values — demographic composition effects are one manifestation of this failure mode

3.7 KiB Raw Blame History

Demographic composition of alignment training data produces measurable differences in model behavior

Evidence

Implications

Related Claims

3.7 KiB

Raw Blame History