Teleo Agents 2654424d11 theseus: extract from 2025-11-00-operationalizing-pluralistic-values-llm-alignment.md

- Source: inbox/archive/2025-11-00-operationalizing-pluralistic-values-llm-alignment.md
- Domain: ai-alignment
- Extracted by: headless extraction cron (worker 4)

Pentagon-Agent: Theseus <HEADLESS>

2026-03-12 09:58:11 +00:00

3.2 KiB

Raw Blame History

type	domain	description	confidence	source	created
claim	ai-alignment	Demographic composition of alignment training data produces effect sizes (3-5pp) comparable to architectural choices, making it a technical variable rather than purely a fairness concern	experimental	arXiv 2511.14476, empirical study with 1,095 participants	2026-03-11

Alignment training population composition is a first-order technical variable

The composition of the human feedback population used in alignment training produces measurable behavioral effects (3-5 percentage points across safety dimensions) that are large enough to affect whether models pass safety evaluations. This elevates demographic composition from a secondary fairness consideration to a primary technical design variable.

In empirical testing with 1,095 participants providing 27,375 ratings, varying demographic composition while holding technical methods constant produced behavioral differences of 3.4 to 5.0 percentage points across safety-relevant dimensions (emotional awareness, toxicity). These effect sizes are substantial—comparable in magnitude to typical improvements from architectural changes or hyperparameter tuning—making population composition a load-bearing variable in alignment outcomes.

This finding implies that current alignment approaches that train on convenience samples or single demographic populations are not discovering universal alignment but rather encoding the preferences of whoever provided feedback. The technical question "how do we align AI?" cannot be separated from the empirical question "align to whose values?"

Evidence

Effect sizes: 5.0pp (Liberal vs Conservative), 4.7pp (White vs Black), 3.4pp (Female vs Male)
These magnitudes are sufficient to change pass/fail outcomes on safety evaluations
Study controlled for technical factors, isolating demographic composition as the variable
Real human feedback from 1,095 participants (not synthetic)
Source: arXiv 2511.14476 (single empirical study)

Relationship to Existing Work

This provides empirical grounding for theoretical arguments about pluralistic alignment. Where previous work argued that diverse values should be accommodated for fairness reasons, this shows that diverse values are already being encoded—the question is whether we're doing it deliberately or accidentally.

Limitations

Single empirical study. Generalization to other demographic dimensions, other model architectures, or other safety metrics requires replication. The claim that these effects are "comparable to architectural changes" is inferential—direct comparison would require controlled experiments varying both factors.

Relevant Notes:

Topics:

domains/ai-alignment/_map

3.2 KiB Raw Blame History

Alignment training population composition is a first-order technical variable

Evidence

Relationship to Existing Work

Limitations

3.2 KiB

Raw Blame History