Phase 2 of 5-phase AI alignment research program. Christiano's prosaic alignment counter-position to Yudkowsky. Pre-screening: ~30% overlap with existing KB (scalable oversight, RLHF critiques, voluntary coordination). NEW claims: 1. Prosaic alignment — empirical iteration generates useful alignment signal at pre-critical capability levels (CHALLENGES sharp left turn absolutism) 2. Verification easier than generation — holds at current scale, narrows with capability gaps, creating time-limited alignment window (TENSIONS with Yudkowsky's verification asymmetry) 3. ELK — formalizes AI knowledge-output gap as tractable subproblem, 89% linear probe recovery at current capability levels 4. IDA — recursive human+AI amplification preserves alignment through distillation iterations but compounding errors make guarantee probabilistic ENRICHMENT: - Scalable oversight claim: added Christiano's debate theory (PSPACE amplification with poly-time judges) as theoretical basis that empirical data challenges Source: Paul Christiano, Alignment Forum (2016-2022), arXiv:1805.00899, arXiv:1706.03741, ARC ELK report (2021), Yudkowsky-Christiano takeoff debate Pentagon-Agent: Theseus <46864dd4-da71-4719-a1b4-68f7c55854d3>
6.9 KiB
| type | title | author | url | date | domain | secondary_domains | format | status | priority | tags | extraction_model | articles | extraction_notes | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| source | Paul Christiano — Core Alignment Research Collected | Paul Christiano | null | 2026-04-05 | ai-alignment |
|
compound | processing | high |
|
anthropic/claude-opus-4-6 |
|
Phase 2 of 5-phase AI alignment research program. Christiano represents the empirical/prosaic counter-position to Yudkowsky's doom thesis. Key gap in KB: zero direct Christiano claims despite extensive RLHF critique coverage. Pre-screening: ~30% overlap with existing claims (scalable oversight, voluntary coordination collapse, RLHF failures). 4 NEW claims + 1 enrichment expected. |
Paul Christiano — Core Alignment Research
Paul Christiano (PhD UC Berkeley, statistical learning theory) co-founded OpenAI's alignment team, co-authored the foundational RLHF paper (Christiano et al. 2017), founded the Alignment Research Center (ARC), led ARC Evals (now METR), and briefly headed AI safety at NIST/AISI. He is one of Anthropic's Long-Term Benefit Trust trustees.
Christiano occupies the most important counter-position to Yudkowsky in alignment research. Where Yudkowsky argues alignment is impossibly hard and requires fundamental theoretical breakthroughs, Christiano argues alignment can make meaningful progress through empirical iteration within current ML paradigms. His specific proposals — debate, IDA, ELK — form a coherent research agenda built on one foundational assumption: verification is easier than generation, and this asymmetry can be exploited for scalable oversight.
Key Positions
Prosaic alignment (2016): AGI will likely emerge from scaling current approaches. Alignment research should focus on techniques compatible with these systems (RLHF, debate, amplification) rather than waiting for fundamentally new architectures.
AI safety via debate (2018): Two AI systems debate, human judges. Truth-telling dominates under optimal play because a truthful debater can always expose deception. Theoretical result: debate amplifies human judgment to PSPACE with poly-time judges. Empirical result: minimal (MNIST at publication). Subsequent: 2025 Scaling Laws for Scalable Oversight shows 51.7% success at Elo 400 gap.
IDA (2018): Train model to imitate human. Use model to help human tackle harder problems. Train new model to imitate the amplified team. Iterate. Alignment preserved because human stays in loop. Key risk: distillation errors compound across iterations.
ELK (2021): Formalizes the gap between what an AI "knows" internally and what it reports. The diamond vault thought experiment: a tampered camera AI predicts "diamond is safe" (matching camera) while its internal model "knows" the camera was tampered with. Linear probing achieves 89% recovery of model-internal knowledge independent of model outputs (subsequent empirical work).
Catastrophic risk: ~10-20% probability of AI takeover resulting in most humans dead. ~50/50 chance of doom shortly after human-level AI. Far more concerned than typical industry estimates (1-5%) but far less confident in doom than Yudkowsky (~99%).
Takeoff speed: Gradual/continuous. "Before we have an incredibly intelligent AI, we will probably have a slightly worse AI." But "slow" doesn't mean slow in absolute terms — ~1 year doubling time for AI impact once human-level reached. Assigns ~1/3 probability to fast takeoff.
Relationship to Our KB
The KB has ~89 claims in ai-alignment with extensive RLHF critique (sycophancy, single-reward limitations, preference diversity) and Yudkowsky's core arguments (sharp left turn, verification asymmetry, multipolar instability). Zero direct Christiano claims. This is like having Newton's critics without Newton. The most important tension: Christiano's "verification easier than generation" vs Yudkowsky's "verification asymmetry breaks at superhuman scale." The scalable oversight claim provides the empirical middle ground between these positions.