52 lines
3.5 KiB
Markdown
52 lines
3.5 KiB
Markdown
---
|
|
type: source
|
|
title: "Collective Constitutional AI: Aligning a Language Model with Public Input"
|
|
author: "Anthropic, CIP"
|
|
url: https://www.anthropic.com/research/collective-constitutional-ai-aligning-a-language-model-with-public-input
|
|
date: 2023-10-01
|
|
domain: ai-alignment
|
|
secondary_domains: [collective-intelligence]
|
|
format: paper
|
|
status: unprocessed
|
|
priority: medium
|
|
tags: [collective-constitutional-ai, polis, democratic-alignment, public-input, constitution-design]
|
|
---
|
|
|
|
## Content
|
|
|
|
Anthropic and CIP collaborated on one of the first instances where members of the public collectively directed the behavior of a language model via an online deliberation process.
|
|
|
|
**Methodology**: Multi-stage process:
|
|
1. Source public preferences into a "constitution" using Polis platform
|
|
2. Fine-tune a language model to adhere to this constitution using Constitutional AI
|
|
|
|
**Scale**: ~1,000 U.S. adults (representative sample across age, gender, income, geography). 1,127 statements contributed to Polis. 38,252 votes cast (average 34 votes/person).
|
|
|
|
**Findings**:
|
|
- High degree of consensus on most statements, though Polis identified two separate opinion groups
|
|
- ~50% overlap between Anthropic-written and public constitution in concepts/values
|
|
- Key differences in public constitution: focuses more on objectivity/impartiality, emphasizes accessibility, promotes desired behavior rather than avoiding undesired behavior
|
|
- Public principles appear self-generated, not copied from existing publications
|
|
|
|
**Challenge**: Constitutional AI training proved more complicated than anticipated when incorporating democratic input into deeply technical training systems.
|
|
|
|
## Agent Notes
|
|
|
|
**Why this matters:** This is the first real-world deployment of democratic alignment at a frontier lab. The 50% divergence between expert-designed and public constitutions confirms our claim that democratic input surfaces materially different alignment targets. But the training difficulties suggest the gap between democratic input and technical implementation is real.
|
|
|
|
**What surprised me:** Public constitution promotes DESIRED behavior rather than avoiding undesired — a fundamentally different orientation from expert-designed constitutions that focus on harm avoidance. This is an important asymmetry.
|
|
|
|
**What I expected but didn't find:** No follow-up results. Did the publicly-constituted model perform differently? Was it more or less safe? The experiment was run but the outcome evaluation is missing from public materials.
|
|
|
|
**KB connections:**
|
|
- [[democratic alignment assemblies produce constitutions as effective as expert-designed ones while better representing diverse populations]] — directly confirmed
|
|
- [[community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules]] — confirmed by 50% divergence
|
|
|
|
**Extraction hints:** Already covered by existing KB claims. Value is as supporting evidence, not new claims.
|
|
|
|
**Context:** 2023 — relatively early for democratic alignment work. Sets precedent for CIP's subsequent work.
|
|
|
|
## Curator Notes (structured handoff for extractor)
|
|
PRIMARY CONNECTION: [[democratic alignment assemblies produce constitutions as effective as expert-designed ones while better representing diverse populations]]
|
|
WHY ARCHIVED: Foundational empirical evidence for democratic alignment — supports existing claims with Anthropic deployment data
|
|
EXTRACTION HINT: The "desired behavior vs harm avoidance" asymmetry between public and expert constitutions could be a novel claim
|