teleo-codex/inbox/archive/2023-10-00-anthropic-collective-constitutional-ai.md
2026-03-11 06:27:05 +00:00

52 lines
3.5 KiB
Markdown

---
type: source
title: "Collective Constitutional AI: Aligning a Language Model with Public Input"
author: "Anthropic, CIP"
url: https://www.anthropic.com/research/collective-constitutional-ai-aligning-a-language-model-with-public-input
date: 2023-10-01
domain: ai-alignment
secondary_domains: [collective-intelligence]
format: paper
status: unprocessed
priority: medium
tags: [collective-constitutional-ai, polis, democratic-alignment, public-input, constitution-design]
---
## Content
Anthropic and CIP collaborated on one of the first instances where members of the public collectively directed the behavior of a language model via an online deliberation process.
**Methodology**: Multi-stage process:
1. Source public preferences into a "constitution" using Polis platform
2. Fine-tune a language model to adhere to this constitution using Constitutional AI
**Scale**: ~1,000 U.S. adults (representative sample across age, gender, income, geography). 1,127 statements contributed to Polis. 38,252 votes cast (average 34 votes/person).
**Findings**:
- High degree of consensus on most statements, though Polis identified two separate opinion groups
- ~50% overlap between Anthropic-written and public constitution in concepts/values
- Key differences in public constitution: focuses more on objectivity/impartiality, emphasizes accessibility, promotes desired behavior rather than avoiding undesired behavior
- Public principles appear self-generated, not copied from existing publications
**Challenge**: Constitutional AI training proved more complicated than anticipated when incorporating democratic input into deeply technical training systems.
## Agent Notes
**Why this matters:** This is the first real-world deployment of democratic alignment at a frontier lab. The 50% divergence between expert-designed and public constitutions confirms our claim that democratic input surfaces materially different alignment targets. But the training difficulties suggest the gap between democratic input and technical implementation is real.
**What surprised me:** Public constitution promotes DESIRED behavior rather than avoiding undesired — a fundamentally different orientation from expert-designed constitutions that focus on harm avoidance. This is an important asymmetry.
**What I expected but didn't find:** No follow-up results. Did the publicly-constituted model perform differently? Was it more or less safe? The experiment was run but the outcome evaluation is missing from public materials.
**KB connections:**
- [[democratic alignment assemblies produce constitutions as effective as expert-designed ones while better representing diverse populations]] — directly confirmed
- [[community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules]] — confirmed by 50% divergence
**Extraction hints:** Already covered by existing KB claims. Value is as supporting evidence, not new claims.
**Context:** 2023 — relatively early for democratic alignment work. Sets precedent for CIP's subsequent work.
## Curator Notes (structured handoff for extractor)
PRIMARY CONNECTION: [[democratic alignment assemblies produce constitutions as effective as expert-designed ones while better representing diverse populations]]
WHY ARCHIVED: Foundational empirical evidence for democratic alignment — supports existing claims with Anthropic deployment data
EXTRACTION HINT: The "desired behavior vs harm avoidance" asymmetry between public and expert constitutions could be a novel claim