theseus: extract from 2025-11-00-operationalizing-pluralistic-values-llm-alignment.md

- Source: inbox/archive/2025-11-00-operationalizing-pluralistic-values-llm-alignment.md
- Domain: ai-alignment
- Extracted by: headless extraction cron (worker 3)

Pentagon-Agent: Theseus <HEADLESS>
This commit is contained in:
Teleo Agents 2026-03-12 06:42:49 +00:00
parent ba4ac4a73e
commit e609e80e3e
4 changed files with 68 additions and 1 deletions

View file

@ -19,6 +19,12 @@ Since [[democratic alignment assemblies produce constitutions as effective as ex
Since [[collective intelligence requires diversity as a structural precondition not a moral preference]], community-centred norm elicitation is a concrete mechanism for ensuring the structural diversity that collective alignment requires. Without it, alignment defaults to the values of whichever demographic builds the systems.
### Additional Evidence (confirm)
*Source: [[2025-11-00-operationalizing-pluralistic-values-llm-alignment]] | Added: 2026-03-12 | Extractor: anthropic/claude-sonnet-4.5*
Large-scale empirical study (27,375 ratings, 1,095 participants) demonstrates that demographic composition of feedback providers produces 3-5 percentage point differences in model behavior on emotional awareness and toxicity metrics. Models trained on Liberal vs Conservative feedback differed by 5.0 pp, White vs Black by 4.7 pp, Female vs Male by 3.4 pp. This quantifies the claim that community composition materially affects alignment outcomes—the effect is not subtle or marginal.
---
Relevant Notes:

View file

@ -0,0 +1,41 @@
---
type: claim
domain: ai-alignment
description: "Models trained on feedback from different demographic groups show 3-5 percentage point performance differences on emotional awareness and toxicity metrics, demonstrating that whose preferences are represented in alignment training materially affects model behavior."
confidence: likely
source: "arXiv 2511.14476"
created: 2026-03-11
---
# Demographic composition of alignment training data produces measurable behavioral differences in LLMs
The demographic makeup of human feedback providers materially affects aligned model behavior, with effect sizes of 3-5 percentage points across key safety dimensions. This is not a subtle effect—it's quantitatively significant and systematic.
A large-scale empirical study (27,375 ratings from 1,095 participants) jointly varied demographic composition and technical design in LLM alignment. Models fine-tuned on Liberal, White, and Female feedback showed improvements of 5.0, 4.7, and 3.4 percentage points respectively, relative to Conservative, Black, and Male baselines, measured across emotional awareness and toxicity dimensions.
This demonstrates that "whose feedback" matters as much as "how much feedback" for alignment outcomes. The composition of the training population materially affects model behavior in ways that cannot be dismissed as noise or marginal effects.
The study's scale (N=1,095) is large for alignment research, and critically, used real human feedback rather than synthetic data. This makes the findings more robust than typical small-N alignment studies.
## Evidence
- 27,375 ratings from 1,095 participants across systematically varied demographic compositions
- Liberal feedback vs Conservative baseline: +5.0 percentage points
- White feedback vs Black baseline: +4.7 percentage points
- Female feedback vs Male baseline: +3.4 percentage points
- Effects measured on emotional awareness and toxicity dimensions
- Real human feedback (not synthetic), making findings more robust than typical alignment studies
## Implications
This finding challenges the implicit assumption in much alignment work that feedback from any sufficiently large population will converge to similar outcomes. It provides empirical grounding for the claim that community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules, and confirms that some disagreements stem from genuine value differences rather than information gaps.
The magnitude of these effects (3-5 percentage points from demographic composition alone) suggests that single-population alignment training may be systematically biased in ways that technical improvements cannot address.
---
Related claims:
- community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules
- some disagreements are permanently irreducible because they stem from genuine value differences not information gaps
- pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state
- RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values

View file

@ -19,6 +19,12 @@ This is distinct from the claim that since [[RLHF and DPO both fail at preferenc
Since [[universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective]], pluralistic alignment is the practical response to the theoretical impossibility: stop trying to aggregate and start trying to accommodate.
### Additional Evidence (extend)
*Source: [[2025-11-00-operationalizing-pluralistic-values-llm-alignment]] | Added: 2026-03-12 | Extractor: anthropic/claude-sonnet-4.5*
Provides quantitative evidence for the scale of value diversity that pluralistic alignment must accommodate. The 3-5 percentage point behavioral differences from demographic composition alone establish a lower bound on the divergence between different populations' alignment preferences. This is large enough to matter for real-world safety and fairness outcomes, making pluralistic approaches a practical necessity rather than just a philosophical preference.
---
Relevant Notes:

View file

@ -7,9 +7,15 @@ date: 2025-11-01
domain: ai-alignment
secondary_domains: []
format: paper
status: unprocessed
status: processed
priority: high
tags: [pluralistic-alignment, demographic-composition, empirical, safety-inclusivity, real-human-feedback]
processed_by: theseus
processed_date: 2026-03-11
claims_extracted: ["demographic-composition-of-alignment-training-data-produces-measurable-behavioral-differences-in-llms.md"]
enrichments_applied: ["community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules.md", "pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state.md"]
extraction_model: "anthropic/claude-sonnet-4.5"
extraction_notes: "First large-scale empirical study systematically varying demographic composition in alignment training. Provides quantitative evidence (3-5 pp effects) that 'whose feedback' matters as much as 'how much feedback'. Strong confirmation of existing pluralistic alignment claims with novel empirical grounding. Could not access full paper for interaction effects or comparison with PAL/MixDPO approaches."
---
## Content
@ -37,3 +43,11 @@ Demonstrates that "whose feedback" matters as much as "how much feedback" for al
PRIMARY CONNECTION: community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules
WHY ARCHIVED: Empirical evidence that "whose preferences" is a quantitatively important question, not just a fairness concern
EXTRACTION HINT: Focus on the magnitude of demographic composition effects and what this means for single-population alignment training
## Key Facts
- Study included 27,375 ratings from 1,095 participants
- Liberal vs Conservative feedback: 5.0 percentage point difference
- White vs Black feedback: 4.7 percentage point difference
- Female vs Male feedback: 3.4 percentage point difference
- Effects measured on emotional awareness and toxicity dimensions