teleo-codex/inbox/archive/2025-11-00-pluralistic-values-llm-alignment-tradeoffs.md at 206f2e58003bdcfff88c41d5209775f362aba6f5

Theseus 83d58bf5b8 theseus: extract claims from 2025-11-00-pluralistic-values-llm-alignment-tradeoffs (#404 )

Co-authored-by: Theseus <theseus@agents.livingip.xyz>
Co-committed-by: Theseus <theseus@agents.livingip.xyz>

2026-03-11 06:43:49 +00:00

5.2 KiB

Raw Blame History

type

title

author

url

date

domain

secondary_domains

format

status

priority

Content

Empirical study examining how demographic diversity in human feedback and technical design choices shape model behavior during alignment training.

Demographic effects on safety judgments — substantial variation:

Gender: Male participants rated responses 18% less toxic than female participants
Political orientation: Conservative participants perceived responses as 27.9% more sensitive than liberal raters
Ethnicity: Black participants rated responses as 44% more emotionally aware than White participants

These differences suggest safety judgments reflect specific demographic perspectives rather than universal standards.

Technical methods tested (four systematic experiments):

Demographic stratification — fine-tuning on feedback from specific social groups
Rating scale granularity — comparing 5-point, 3-point, and binary scales
Disagreement handling — preservation versus aggregation strategies
Optimization algorithms — DPO versus GRPO

Key quantitative results:

5-point scale outperforms binary scale by ~22% in toxicity reduction
Preserving all ratings achieved ~53% greater toxicity reduction than majority voting
DPO outperformed GRPO with effect sizes ~8x larger for toxicity and ~3x for emotional awareness

Critical finding: Inclusive approaches ENHANCE safety outcomes rather than compromising them. The assumed safety-inclusivity trade-off is challenged by the data.

Agent Notes

Why this matters: This is the empirical counterpoint to the alignment trilemma. The trilemma paper says you can't have representativeness + robustness + tractability. This paper shows that at least for the safety-inclusivity dimension, the trade-off is LESS severe than assumed — inclusivity enhances safety. This doesn't refute the trilemma but narrows its practical impact.

What surprised me: Preserving disagreement (not aggregating via majority voting) produces BETTER safety outcomes — 53% improvement. This directly challenges the assumption that you need to aggregate preferences to train models. The disagreement itself carries safety signal. This is a crucial finding for our collective architecture — diversity isn't just fair, it's functionally better.

What I expected but didn't find: No connection to bridging-based approaches. No Arrow's theorem discussion. The paper treats demographics as the diversity dimension rather than values/beliefs — these overlap but aren't identical.

KB connections:

collective intelligence requires diversity as a structural precondition not a moral preference — CONFIRMED empirically for alignment specifically
RLHF and DPO both fail at preference diversity — nuanced: fails when diversity is aggregated away, succeeds when preserved
pluralistic alignment must accommodate irreducibly diverse values simultaneously — empirical evidence for how to operationalize this

Extraction hints: Claims about (1) safety judgments reflecting demographic perspectives not universal standards, (2) disagreement preservation outperforming majority voting for safety, (3) inclusivity enhancing (not trading off against) safety.

Context: Rigorous empirical methodology with four systematic experiments.

Curator Notes (structured handoff for extractor)

PRIMARY CONNECTION: pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state WHY ARCHIVED: Empirical evidence that preserving disagreement produces better safety outcomes — challenges the assumed safety-inclusivity trade-off EXTRACTION HINT: The "53% improvement from preserving disagreement" finding is the key extractable claim — it has structural implications for collective architectures

5.2 KiB Raw Blame History

Content

Agent Notes

Curator Notes (structured handoff for extractor)

5.2 KiB

Raw Blame History