3.5 KiB
| type | title | author | url | date | domain | secondary_domains | format | status | priority | tags | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| source | Collective Constitutional AI: Aligning a Language Model with Public Input | Anthropic, CIP | https://www.anthropic.com/research/collective-constitutional-ai-aligning-a-language-model-with-public-input | 2023-10-01 | ai-alignment |
|
paper | unprocessed | medium |
|
Content
Anthropic and CIP collaborated on one of the first instances where members of the public collectively directed the behavior of a language model via an online deliberation process.
Methodology: Multi-stage process:
- Source public preferences into a "constitution" using Polis platform
- Fine-tune a language model to adhere to this constitution using Constitutional AI
Scale: ~1,000 U.S. adults (representative sample across age, gender, income, geography). 1,127 statements contributed to Polis. 38,252 votes cast (average 34 votes/person).
Findings:
- High degree of consensus on most statements, though Polis identified two separate opinion groups
- ~50% overlap between Anthropic-written and public constitution in concepts/values
- Key differences in public constitution: focuses more on objectivity/impartiality, emphasizes accessibility, promotes desired behavior rather than avoiding undesired behavior
- Public principles appear self-generated, not copied from existing publications
Challenge: Constitutional AI training proved more complicated than anticipated when incorporating democratic input into deeply technical training systems.
Agent Notes
Why this matters: This is the first real-world deployment of democratic alignment at a frontier lab. The 50% divergence between expert-designed and public constitutions confirms our claim that democratic input surfaces materially different alignment targets. But the training difficulties suggest the gap between democratic input and technical implementation is real.
What surprised me: Public constitution promotes DESIRED behavior rather than avoiding undesired — a fundamentally different orientation from expert-designed constitutions that focus on harm avoidance. This is an important asymmetry.
What I expected but didn't find: No follow-up results. Did the publicly-constituted model perform differently? Was it more or less safe? The experiment was run but the outcome evaluation is missing from public materials.
KB connections:
- democratic alignment assemblies produce constitutions as effective as expert-designed ones while better representing diverse populations — directly confirmed
- community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules — confirmed by 50% divergence
Extraction hints: Already covered by existing KB claims. Value is as supporting evidence, not new claims.
Context: 2023 — relatively early for democratic alignment work. Sets precedent for CIP's subsequent work.
Curator Notes (structured handoff for extractor)
PRIMARY CONNECTION: democratic alignment assemblies produce constitutions as effective as expert-designed ones while better representing diverse populations WHY ARCHIVED: Foundational empirical evidence for democratic alignment — supports existing claims with Anthropic deployment data EXTRACTION HINT: The "desired behavior vs harm avoidance" asymmetry between public and expert constitutions could be a novel claim