teleo-codex/inbox/archive/2025-11-00-operationalizing-pluralistic-values-llm-alignment.md
Teleo Agents 9fafd4eb38 auto-fix: strip 5 broken wiki links
Pipeline auto-fixer: removed [[ ]] brackets from links
that don't resolve to existing claims in the knowledge base.
2026-03-14 11:22:04 +00:00

53 lines
3.8 KiB
Markdown

---
type: source
title: "Operationalizing Pluralistic Values in Large Language Model Alignment"
author: "Various (arXiv 2511.14476)"
url: https://arxiv.org/pdf/2511.14476
date: 2025-11-01
domain: ai-alignment
secondary_domains: []
format: paper
status: processed
priority: high
tags: [pluralistic-alignment, demographic-composition, empirical, safety-inclusivity, real-human-feedback]
processed_by: theseus
processed_date: 2026-03-11
claims_extracted: ["demographic-composition-of-alignment-training-data-materially-affects-model-behavior-with-3-5-percentage-point-effects.md"]
enrichments_applied: ["community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules.md", "some disagreements are permanently irreducible because they stem from genuine value differences not information gaps and systems must map rather than eliminate them.md"]
extraction_model: "anthropic/claude-sonnet-4.5"
extraction_notes: "Single high-quality claim extracted with strong empirical grounding (N=1,095, 27,375 ratings). Three enrichments to existing pluralistic alignment claims. Could not access full paper—extraction based on search summary and agent notes. Full paper would likely contain additional insights on interaction effects and comparison with other pluralistic alignment approaches."
---
## Content
Systematic empirical study of LLM alignment with real human feedback: 27,375 ratings from 1,095 participants.
**Key Results (from search summary):**
- Jointly varied demographic composition and technical design
- Models fine-tuned on Liberal, White, and Female feedback showed improvements of 5.0, 4.7, and 3.4 percentage points respectively
- Relative to Conservative, Black, and Male baselines
- Measured across emotional awareness and toxicity dimensions
**Key Contribution:**
Demonstrates that "whose feedback" matters as much as "how much feedback" for alignment outcomes. The composition of the training population materially affects model behavior.
## Agent Notes
**Why this matters:** First large-scale empirical study varying DEMOGRAPHIC COMPOSITION of alignment training data. Proves that the composition question (whose preferences?) has measurable, quantitative effects on model behavior.
**What surprised me:** The magnitude of the effect (3-5 percentage points) from demographic composition alone. This is not a subtle effect.
**What I expected but didn't find:** Couldn't access full paper. Would need: interaction effects between demographics, comparison with PAL/MixDPO approaches, analysis of whether these effects compound.
**KB connections:** Directly supports [[community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules]]. Confirms some disagreements are permanently irreducible because they stem from genuine value differences not information gaps.
**Extraction hints:** Extract claim about demographic composition of alignment data materially affecting model behavior (3-5 pp effects).
**Context:** 1,095 participants is a large N for alignment research. Real human feedback, not synthetic.
## Curator Notes (structured handoff for extractor)
PRIMARY CONNECTION: community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules
WHY ARCHIVED: Empirical evidence that "whose preferences" is a quantitatively important question, not just a fairness concern
EXTRACTION HINT: Focus on the magnitude of demographic composition effects and what this means for single-population alignment training
## Key Facts
- Study included 1,095 participants providing 27,375 ratings
- Liberal training data: +5.0 pp vs Conservative baseline
- White training data: +4.7 pp vs Black baseline
- Female training data: +3.4 pp vs Male baseline
- Effects measured on emotional awareness and toxicity dimensions