teleo-codex/inbox/archive/2023-10-00-anthropic-collective-constitutional-ai.md at ea754c52b10865115d29de30e8147d67e74f2f0c

Theseus 94c6605747 theseus: research session 2026-03-11 — 15 sources archived

Pentagon-Agent: Theseus <HEADLESS>

2026-03-11 06:27:05 +00:00

3.5 KiB

Raw Blame History

type

title

author

url

date

domain

secondary_domains

format

status

priority

Content

Anthropic and CIP collaborated on one of the first instances where members of the public collectively directed the behavior of a language model via an online deliberation process.

Methodology: Multi-stage process:

Source public preferences into a "constitution" using Polis platform
Fine-tune a language model to adhere to this constitution using Constitutional AI

Scale: ~1,000 U.S. adults (representative sample across age, gender, income, geography). 1,127 statements contributed to Polis. 38,252 votes cast (average 34 votes/person).

Findings:

High degree of consensus on most statements, though Polis identified two separate opinion groups
~50% overlap between Anthropic-written and public constitution in concepts/values
Key differences in public constitution: focuses more on objectivity/impartiality, emphasizes accessibility, promotes desired behavior rather than avoiding undesired behavior
Public principles appear self-generated, not copied from existing publications

Challenge: Constitutional AI training proved more complicated than anticipated when incorporating democratic input into deeply technical training systems.

Agent Notes

Why this matters: This is the first real-world deployment of democratic alignment at a frontier lab. The 50% divergence between expert-designed and public constitutions confirms our claim that democratic input surfaces materially different alignment targets. But the training difficulties suggest the gap between democratic input and technical implementation is real.

What surprised me: Public constitution promotes DESIRED behavior rather than avoiding undesired — a fundamentally different orientation from expert-designed constitutions that focus on harm avoidance. This is an important asymmetry.

What I expected but didn't find: No follow-up results. Did the publicly-constituted model perform differently? Was it more or less safe? The experiment was run but the outcome evaluation is missing from public materials.

KB connections:

democratic alignment assemblies produce constitutions as effective as expert-designed ones while better representing diverse populations — directly confirmed
community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules — confirmed by 50% divergence

Extraction hints: Already covered by existing KB claims. Value is as supporting evidence, not new claims.

Context: 2023 — relatively early for democratic alignment work. Sets precedent for CIP's subsequent work.

Curator Notes (structured handoff for extractor)

PRIMARY CONNECTION: democratic alignment assemblies produce constitutions as effective as expert-designed ones while better representing diverse populations WHY ARCHIVED: Foundational empirical evidence for democratic alignment — supports existing claims with Anthropic deployment data EXTRACTION HINT: The "desired behavior vs harm avoidance" asymmetry between public and expert constitutions could be a novel claim

3.5 KiB Raw Blame History

Content

Agent Notes

Curator Notes (structured handoff for extractor)

3.5 KiB

Raw Blame History