extract: 2025-11-00-operationalizing-pluralistic-values-llm-alignment
Pentagon-Agent: Ganymede <F99EBFA6-547B-4096-BEEA-1D59C3E4028A>
This commit is contained in:
parent
116603acd9
commit
d48d2e2c7b
5 changed files with 55 additions and 1 deletions
|
|
@ -19,6 +19,12 @@ Since [[democratic alignment assemblies produce constitutions as effective as ex
|
|||
|
||||
Since [[collective intelligence requires diversity as a structural precondition not a moral preference]], community-centred norm elicitation is a concrete mechanism for ensuring the structural diversity that collective alignment requires. Without it, alignment defaults to the values of whichever demographic builds the systems.
|
||||
|
||||
|
||||
### Additional Evidence (confirm)
|
||||
*Source: [[2025-11-00-operationalizing-pluralistic-values-llm-alignment]] | Added: 2026-03-15*
|
||||
|
||||
Empirical study with 27,375 ratings from 1,095 participants shows that demographic composition of training data produces 3-5 percentage point differences in model behavior across emotional awareness and toxicity dimensions. This quantifies the magnitude of difference between community-sourced and developer-specified alignment targets.
|
||||
|
||||
---
|
||||
|
||||
Relevant Notes:
|
||||
|
|
|
|||
|
|
@ -27,6 +27,12 @@ Chakraborty, Qiu, Yuan, Koppel, Manocha, Huang, Bedi, Wang. "MaxMin-RLHF: Alignm
|
|||
- GPT-2 experiment: single RLHF achieved positive sentiment but ignored conciseness
|
||||
- Tulu2-7B experiment: minority group accuracy dropped from 70.4% to 42% at 10:1 ratio
|
||||
|
||||
|
||||
### Additional Evidence (confirm)
|
||||
*Source: [[2025-11-00-operationalizing-pluralistic-values-llm-alignment]] | Added: 2026-03-15*
|
||||
|
||||
Study demonstrates that models trained on different demographic populations show measurable behavioral divergence (3-5 percentage points), providing empirical evidence that single-reward functions trained on one population systematically misalign with others.
|
||||
|
||||
---
|
||||
|
||||
Relevant Notes:
|
||||
|
|
|
|||
|
|
@ -21,6 +21,12 @@ The correct response is to map the disagreement rather than eliminate it. Identi
|
|||
|
||||
[[Collective intelligence within a purpose-driven community faces a structural tension because shared worldview correlates errors while shared purpose enables coordination]]. Persistent irreducible disagreement is actually a safeguard here -- it prevents the correlated error problem by maintaining genuine diversity of perspective within a coordinated community. The independence-coherence tradeoff is managed not by eliminating disagreement but by channeling it productively.
|
||||
|
||||
|
||||
### Additional Evidence (confirm)
|
||||
*Source: [[2025-11-00-operationalizing-pluralistic-values-llm-alignment]] | Added: 2026-03-15*
|
||||
|
||||
Systematic variation of demographic composition in alignment training produced persistent behavioral differences across Liberal/Conservative, White/Black, and Female/Male populations, suggesting these reflect genuine value differences rather than information asymmetries that could be resolved.
|
||||
|
||||
---
|
||||
|
||||
Relevant Notes:
|
||||
|
|
|
|||
|
|
@ -0,0 +1,24 @@
|
|||
{
|
||||
"rejected_claims": [
|
||||
{
|
||||
"filename": "demographic-composition-of-alignment-training-data-produces-measurable-behavioral-differences-in-model-outputs.md",
|
||||
"issues": [
|
||||
"missing_attribution_extractor"
|
||||
]
|
||||
}
|
||||
],
|
||||
"validation_stats": {
|
||||
"total": 1,
|
||||
"kept": 0,
|
||||
"fixed": 1,
|
||||
"rejected": 1,
|
||||
"fixes_applied": [
|
||||
"demographic-composition-of-alignment-training-data-produces-measurable-behavioral-differences-in-model-outputs.md:set_created:2026-03-15"
|
||||
],
|
||||
"rejections": [
|
||||
"demographic-composition-of-alignment-training-data-produces-measurable-behavioral-differences-in-model-outputs.md:missing_attribution_extractor"
|
||||
]
|
||||
},
|
||||
"model": "anthropic/claude-sonnet-4.5",
|
||||
"date": "2026-03-15"
|
||||
}
|
||||
|
|
@ -7,9 +7,13 @@ date: 2025-11-01
|
|||
domain: ai-alignment
|
||||
secondary_domains: []
|
||||
format: paper
|
||||
status: unprocessed
|
||||
status: enrichment
|
||||
priority: high
|
||||
tags: [pluralistic-alignment, demographic-composition, empirical, safety-inclusivity, real-human-feedback]
|
||||
processed_by: theseus
|
||||
processed_date: 2026-03-15
|
||||
enrichments_applied: ["community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules.md", "single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md", "some disagreements are permanently irreducible because they stem from genuine value differences not information gaps and systems must map rather than eliminate them.md"]
|
||||
extraction_model: "anthropic/claude-sonnet-4.5"
|
||||
---
|
||||
|
||||
## Content
|
||||
|
|
@ -37,3 +41,11 @@ Demonstrates that "whose feedback" matters as much as "how much feedback" for al
|
|||
PRIMARY CONNECTION: community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules
|
||||
WHY ARCHIVED: Empirical evidence that "whose preferences" is a quantitatively important question, not just a fairness concern
|
||||
EXTRACTION HINT: Focus on the magnitude of demographic composition effects and what this means for single-population alignment training
|
||||
|
||||
|
||||
## Key Facts
|
||||
- Study included 27,375 ratings from 1,095 participants
|
||||
- Models fine-tuned on Liberal feedback showed 5.0 percentage point improvement over Conservative baseline
|
||||
- Models fine-tuned on White feedback showed 4.7 percentage point improvement over Black baseline
|
||||
- Models fine-tuned on Female feedback showed 3.4 percentage point improvement over Male baseline
|
||||
- Effects measured across emotional awareness and toxicity dimensions
|
||||
|
|
|
|||
Loading…
Reference in a new issue