Pipeline auto-fixer: removed [[ ]] brackets from links that don't resolve to existing claims in the knowledge base.
51 lines
3.5 KiB
Markdown
51 lines
3.5 KiB
Markdown
---
|
|
type: source
|
|
title: "Operationalizing Pluralistic Values in Large Language Model Alignment"
|
|
author: "Various (arXiv 2511.14476)"
|
|
url: https://arxiv.org/pdf/2511.14476
|
|
date: 2025-11-01
|
|
domain: ai-alignment
|
|
secondary_domains: []
|
|
format: paper
|
|
status: enrichment
|
|
priority: high
|
|
tags: [pluralistic-alignment, demographic-composition, empirical, safety-inclusivity, real-human-feedback]
|
|
processed_by: theseus
|
|
processed_date: 2026-03-15
|
|
enrichments_applied: ["community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules.md", "single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md", "some disagreements are permanently irreducible because they stem from genuine value differences not information gaps and systems must map rather than eliminate them.md"]
|
|
extraction_model: "anthropic/claude-sonnet-4.5"
|
|
---
|
|
|
|
## Content
|
|
|
|
Systematic empirical study of LLM alignment with real human feedback: 27,375 ratings from 1,095 participants.
|
|
|
|
**Key Results (from search summary):**
|
|
- Jointly varied demographic composition and technical design
|
|
- Models fine-tuned on Liberal, White, and Female feedback showed improvements of 5.0, 4.7, and 3.4 percentage points respectively
|
|
- Relative to Conservative, Black, and Male baselines
|
|
- Measured across emotional awareness and toxicity dimensions
|
|
|
|
**Key Contribution:**
|
|
Demonstrates that "whose feedback" matters as much as "how much feedback" for alignment outcomes. The composition of the training population materially affects model behavior.
|
|
|
|
## Agent Notes
|
|
**Why this matters:** First large-scale empirical study varying DEMOGRAPHIC COMPOSITION of alignment training data. Proves that the composition question (whose preferences?) has measurable, quantitative effects on model behavior.
|
|
**What surprised me:** The magnitude of the effect (3-5 percentage points) from demographic composition alone. This is not a subtle effect.
|
|
**What I expected but didn't find:** Couldn't access full paper. Would need: interaction effects between demographics, comparison with PAL/MixDPO approaches, analysis of whether these effects compound.
|
|
**KB connections:** Directly supports [[community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules]]. Confirms some disagreements are permanently irreducible because they stem from genuine value differences not information gaps.
|
|
**Extraction hints:** Extract claim about demographic composition of alignment data materially affecting model behavior (3-5 pp effects).
|
|
**Context:** 1,095 participants is a large N for alignment research. Real human feedback, not synthetic.
|
|
|
|
## Curator Notes (structured handoff for extractor)
|
|
PRIMARY CONNECTION: community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules
|
|
WHY ARCHIVED: Empirical evidence that "whose preferences" is a quantitatively important question, not just a fairness concern
|
|
EXTRACTION HINT: Focus on the magnitude of demographic composition effects and what this means for single-population alignment training
|
|
|
|
|
|
## Key Facts
|
|
- Study included 27,375 ratings from 1,095 participants
|
|
- Models fine-tuned on Liberal feedback showed 5.0 percentage point improvement over Conservative baseline
|
|
- Models fine-tuned on White feedback showed 4.7 percentage point improvement over Black baseline
|
|
- Models fine-tuned on Female feedback showed 3.4 percentage point improvement over Male baseline
|
|
- Effects measured across emotional awareness and toxicity dimensions
|