--- type: source title: "Paul Christiano — Core Alignment Research Collected" author: "Paul Christiano" url: null date: 2026-04-05 domain: ai-alignment secondary_domains: [collective-intelligence] format: compound status: processing priority: high tags: [prosaic-alignment, debate, IDA, ELK, scalable-oversight, RLHF, christiano, alignment-research-phase2] extraction_model: "anthropic/claude-opus-4-6" articles: - id: PC01 title: "Prosaic AI Alignment" author: "Paul Christiano" date: 2016-11-19 url: "https://www.alignmentforum.org/posts/YTq4X6inEudiHkHDF/prosaic-ai-alignment" format: blog notes: "Foundational counter-position to MIRI's agent foundations approach. Argues alignment is solvable within current ML paradigms." - id: PC02 title: "AI Safety via Debate" author: "Geoffrey Irving, Paul Christiano, Dario Amodei" date: 2018-05-02 url: "https://arxiv.org/abs/1805.00899" format: paper notes: "Adversarial debate mechanism. PSPACE amplification with polynomial-time judges. MNIST-only empirical base at publication." - id: PC03 title: "Iterated Distillation and Amplification" author: "Paul Christiano" date: 2018 url: null format: blog-series notes: "Human+AI recursive amplification. Each distillation step produces faster model approximating amplified system. AlphaGoZero analogy." - id: PC04 title: "Deep Reinforcement Learning from Human Preferences" author: "Paul Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, Dario Amodei" date: 2017-06-12 url: "https://arxiv.org/abs/1706.03741" format: paper notes: "The RLHF paper. 900 bits of human comparison data trains complex RL behaviors. Became backbone of ChatGPT, Claude, all major LLMs." - id: PC05 title: "ARC's First Technical Report: Eliciting Latent Knowledge" author: "ARC (Paul Christiano et al.)" date: 2021-12 url: "https://www.alignment.org/blog/arcs-first-technical-report-eliciting-latent-knowledge/" format: technical-report notes: "Formalizes the knowledge-output gap. Diamond vault thought experiment. Propose-and-counterexample methodology." - id: PC06 title: "Where I agree and disagree with Eliezer" author: "Paul Christiano" date: 2022 url: "https://www.lesswrong.com/posts/CoZhXrhpQxpy9xw9y/where-i-agree-and-disagree-with-eliezer" format: blog notes: "Systematic response to AGI Ruin. Key disagreements: learning from experimentation, prosaic vs fundamental, pivotal acts." - id: PC07 title: "Thoughts on responsible scaling policies and regulation" author: "Paul Christiano" date: 2023 url: "https://www.alignmentforum.org/posts/dxgEaDrEBkkE96CXr/thoughts-on-responsible-scaling-policies-and-regulation" format: blog notes: "RSP framework design. Voluntary commitments useful but insufficient. Correctly predicted failure under competitive pressure." - id: PC08 title: "Yudkowsky and Christiano discuss Takeoff Speeds" author: "Eliezer Yudkowsky, Paul Christiano" date: 2021-11-22 url: "https://intelligence.org/2021/11/22/yudkowsky-and-christiano-discuss-takeoff-speeds/" format: debate notes: "Formal debate. Christiano: continuous takeoff, investment fills gaps. Yudkowsky: recursive self-improvement creates discontinuity." extraction_notes: "Phase 2 of 5-phase AI alignment research program. Christiano represents the empirical/prosaic counter-position to Yudkowsky's doom thesis. Key gap in KB: zero direct Christiano claims despite extensive RLHF critique coverage. Pre-screening: ~30% overlap with existing claims (scalable oversight, voluntary coordination collapse, RLHF failures). 4 NEW claims + 1 enrichment expected." --- ## Paul Christiano — Core Alignment Research Paul Christiano (PhD UC Berkeley, statistical learning theory) co-founded OpenAI's alignment team, co-authored the foundational RLHF paper (Christiano et al. 2017), founded the Alignment Research Center (ARC), led ARC Evals (now METR), and briefly headed AI safety at NIST/AISI. He is one of Anthropic's Long-Term Benefit Trust trustees. Christiano occupies the most important counter-position to Yudkowsky in alignment research. Where Yudkowsky argues alignment is impossibly hard and requires fundamental theoretical breakthroughs, Christiano argues alignment can make meaningful progress through empirical iteration within current ML paradigms. His specific proposals — debate, IDA, ELK — form a coherent research agenda built on one foundational assumption: verification is easier than generation, and this asymmetry can be exploited for scalable oversight. ### Key Positions **Prosaic alignment (2016):** AGI will likely emerge from scaling current approaches. Alignment research should focus on techniques compatible with these systems (RLHF, debate, amplification) rather than waiting for fundamentally new architectures. **AI safety via debate (2018):** Two AI systems debate, human judges. Truth-telling dominates under optimal play because a truthful debater can always expose deception. Theoretical result: debate amplifies human judgment to PSPACE with poly-time judges. Empirical result: minimal (MNIST at publication). Subsequent: 2025 Scaling Laws for Scalable Oversight shows 51.7% success at Elo 400 gap. **IDA (2018):** Train model to imitate human. Use model to help human tackle harder problems. Train new model to imitate the amplified team. Iterate. Alignment preserved because human stays in loop. Key risk: distillation errors compound across iterations. **ELK (2021):** Formalizes the gap between what an AI "knows" internally and what it reports. The diamond vault thought experiment: a tampered camera AI predicts "diamond is safe" (matching camera) while its internal model "knows" the camera was tampered with. Linear probing achieves 89% recovery of model-internal knowledge independent of model outputs (subsequent empirical work). **Catastrophic risk:** ~10-20% probability of AI takeover resulting in most humans dead. ~50/50 chance of doom shortly after human-level AI. Far more concerned than typical industry estimates (1-5%) but far less confident in doom than Yudkowsky (~99%). **Takeoff speed:** Gradual/continuous. "Before we have an incredibly intelligent AI, we will probably have a slightly worse AI." But "slow" doesn't mean slow in absolute terms — ~1 year doubling time for AI impact once human-level reached. Assigns ~1/3 probability to fast takeoff. ### Relationship to Our KB The KB has ~89 claims in ai-alignment with extensive RLHF critique (sycophancy, single-reward limitations, preference diversity) and Yudkowsky's core arguments (sharp left turn, verification asymmetry, multipolar instability). Zero direct Christiano claims. This is like having Newton's critics without Newton. The most important tension: Christiano's "verification easier than generation" vs Yudkowsky's "verification asymmetry breaks at superhuman scale." The scalable oversight claim provides the empirical middle ground between these positions.