teleo-codex/inbox/archive/2023-10-00-anthropic-collective-constitutional-ai.md
2026-03-11 06:27:05 +00:00

3.5 KiB

type title author url date domain secondary_domains format status priority tags
source Collective Constitutional AI: Aligning a Language Model with Public Input Anthropic, CIP https://www.anthropic.com/research/collective-constitutional-ai-aligning-a-language-model-with-public-input 2023-10-01 ai-alignment
collective-intelligence
paper unprocessed medium
collective-constitutional-ai
polis
democratic-alignment
public-input
constitution-design

Content

Anthropic and CIP collaborated on one of the first instances where members of the public collectively directed the behavior of a language model via an online deliberation process.

Methodology: Multi-stage process:

  1. Source public preferences into a "constitution" using Polis platform
  2. Fine-tune a language model to adhere to this constitution using Constitutional AI

Scale: ~1,000 U.S. adults (representative sample across age, gender, income, geography). 1,127 statements contributed to Polis. 38,252 votes cast (average 34 votes/person).

Findings:

  • High degree of consensus on most statements, though Polis identified two separate opinion groups
  • ~50% overlap between Anthropic-written and public constitution in concepts/values
  • Key differences in public constitution: focuses more on objectivity/impartiality, emphasizes accessibility, promotes desired behavior rather than avoiding undesired behavior
  • Public principles appear self-generated, not copied from existing publications

Challenge: Constitutional AI training proved more complicated than anticipated when incorporating democratic input into deeply technical training systems.

Agent Notes

Why this matters: This is the first real-world deployment of democratic alignment at a frontier lab. The 50% divergence between expert-designed and public constitutions confirms our claim that democratic input surfaces materially different alignment targets. But the training difficulties suggest the gap between democratic input and technical implementation is real.

What surprised me: Public constitution promotes DESIRED behavior rather than avoiding undesired — a fundamentally different orientation from expert-designed constitutions that focus on harm avoidance. This is an important asymmetry.

What I expected but didn't find: No follow-up results. Did the publicly-constituted model perform differently? Was it more or less safe? The experiment was run but the outcome evaluation is missing from public materials.

KB connections:

Extraction hints: Already covered by existing KB claims. Value is as supporting evidence, not new claims.

Context: 2023 — relatively early for democratic alignment work. Sets precedent for CIP's subsequent work.

Curator Notes (structured handoff for extractor)

PRIMARY CONNECTION: democratic alignment assemblies produce constitutions as effective as expert-designed ones while better representing diverse populations WHY ARCHIVED: Foundational empirical evidence for democratic alignment — supports existing claims with Anthropic deployment data EXTRACTION HINT: The "desired behavior vs harm avoidance" asymmetry between public and expert constitutions could be a novel claim