teleo-codex/inbox/archive/ai-alignment/2025-11-00-operationalizing-pluralistic-values-llm-alignment.md
m3taversal be8ff41bfe link: bidirectional source↔claim index — 414 claims + 252 sources connected
Wrote sourced_from: into 414 claim files pointing back to their origin source.
Backfilled claims_extracted: into 252 source files that were processed but
missing this field. Matching uses author+title overlap against claim source:
field, validated against 296 known-good pairs from existing claims_extracted.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-21 11:55:18 +01:00

3.9 KiB

type title author url date domain secondary_domains format status priority tags processed_by processed_date enrichments_applied extraction_model claims_extracted
source Operationalizing Pluralistic Values in Large Language Model Alignment Various (arXiv 2511.14476) https://arxiv.org/pdf/2511.14476 2025-11-01 ai-alignment
paper enrichment high
pluralistic-alignment
demographic-composition
empirical
safety-inclusivity
real-human-feedback
theseus 2026-03-15
community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules.md
single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md
some disagreements are permanently irreducible because they stem from genuine value differences not information gaps and systems must map rather than eliminate them.md
anthropic/claude-sonnet-4.5
modeling preference sensitivity as a learned distribution rather than a fixed scalar resolves DPO diversity failures without demographic labels or explicit user modeling
the variance of a learned preference sensitivity distribution diagnoses dataset heterogeneity and collapses to fixed-parameter behavior when preferences are homogeneous

Content

Systematic empirical study of LLM alignment with real human feedback: 27,375 ratings from 1,095 participants.

Key Results (from search summary):

  • Jointly varied demographic composition and technical design
  • Models fine-tuned on Liberal, White, and Female feedback showed improvements of 5.0, 4.7, and 3.4 percentage points respectively
  • Relative to Conservative, Black, and Male baselines
  • Measured across emotional awareness and toxicity dimensions

Key Contribution: Demonstrates that "whose feedback" matters as much as "how much feedback" for alignment outcomes. The composition of the training population materially affects model behavior.

Agent Notes

Why this matters: First large-scale empirical study varying DEMOGRAPHIC COMPOSITION of alignment training data. Proves that the composition question (whose preferences?) has measurable, quantitative effects on model behavior. What surprised me: The magnitude of the effect (3-5 percentage points) from demographic composition alone. This is not a subtle effect. What I expected but didn't find: Couldn't access full paper. Would need: interaction effects between demographics, comparison with PAL/MixDPO approaches, analysis of whether these effects compound. KB connections: Directly supports community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules. Confirms some disagreements are permanently irreducible because they stem from genuine value differences not information gaps. Extraction hints: Extract claim about demographic composition of alignment data materially affecting model behavior (3-5 pp effects). Context: 1,095 participants is a large N for alignment research. Real human feedback, not synthetic.

Curator Notes (structured handoff for extractor)

PRIMARY CONNECTION: community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules WHY ARCHIVED: Empirical evidence that "whose preferences" is a quantitatively important question, not just a fairness concern EXTRACTION HINT: Focus on the magnitude of demographic composition effects and what this means for single-population alignment training

Key Facts

  • Study included 27,375 ratings from 1,095 participants
  • Models fine-tuned on Liberal feedback showed 5.0 percentage point improvement over Conservative baseline
  • Models fine-tuned on White feedback showed 4.7 percentage point improvement over Black baseline
  • Models fine-tuned on Female feedback showed 3.4 percentage point improvement over Male baseline
  • Effects measured across emotional awareness and toxicity dimensions