teleo-codex/inbox/archive/ai-alignment/christiano-core-alignment-research-collected.md
m3taversal 08dea4249f theseus: extract 4 NEW claims + 1 enrichment from Christiano core alignment research
Phase 2 of 5-phase AI alignment research program. Christiano's prosaic
alignment counter-position to Yudkowsky. Pre-screening: ~30% overlap with
existing KB (scalable oversight, RLHF critiques, voluntary coordination).

NEW claims:
1. Prosaic alignment — empirical iteration generates useful alignment signal at
   pre-critical capability levels (CHALLENGES sharp left turn absolutism)
2. Verification easier than generation — holds at current scale, narrows with
   capability gaps, creating time-limited alignment window (TENSIONS with
   Yudkowsky's verification asymmetry)
3. ELK — formalizes AI knowledge-output gap as tractable subproblem, 89%
   linear probe recovery at current capability levels
4. IDA — recursive human+AI amplification preserves alignment through
   distillation iterations but compounding errors make guarantee probabilistic

ENRICHMENT:
- Scalable oversight claim: added Christiano's debate theory (PSPACE
  amplification with poly-time judges) as theoretical basis that empirical
  data challenges

Source: Paul Christiano, Alignment Forum (2016-2022), arXiv:1805.00899,
arXiv:1706.03741, ARC ELK report (2021), Yudkowsky-Christiano takeoff debate

Pentagon-Agent: Theseus <46864dd4-da71-4719-a1b4-68f7c55854d3>
2026-04-05 20:16:59 +01:00

6.9 KiB

type title author url date domain secondary_domains format status priority tags extraction_model articles extraction_notes
source Paul Christiano — Core Alignment Research Collected Paul Christiano null 2026-04-05 ai-alignment
collective-intelligence
compound processing high
prosaic-alignment
debate
IDA
ELK
scalable-oversight
RLHF
christiano
alignment-research-phase2
anthropic/claude-opus-4-6
id title author date url format notes
PC01 Prosaic AI Alignment Paul Christiano 2016-11-19 https://www.alignmentforum.org/posts/YTq4X6inEudiHkHDF/prosaic-ai-alignment blog Foundational counter-position to MIRI's agent foundations approach. Argues alignment is solvable within current ML paradigms.
id title author date url format notes
PC02 AI Safety via Debate Geoffrey Irving, Paul Christiano, Dario Amodei 2018-05-02 https://arxiv.org/abs/1805.00899 paper Adversarial debate mechanism. PSPACE amplification with polynomial-time judges. MNIST-only empirical base at publication.
id title author date url format notes
PC03 Iterated Distillation and Amplification Paul Christiano 2018 null blog-series Human+AI recursive amplification. Each distillation step produces faster model approximating amplified system. AlphaGoZero analogy.
id title author date url format notes
PC04 Deep Reinforcement Learning from Human Preferences Paul Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, Dario Amodei 2017-06-12 https://arxiv.org/abs/1706.03741 paper The RLHF paper. 900 bits of human comparison data trains complex RL behaviors. Became backbone of ChatGPT, Claude, all major LLMs.
id title author date url format notes
PC05 ARC's First Technical Report: Eliciting Latent Knowledge ARC (Paul Christiano et al.) 2021-12 https://www.alignment.org/blog/arcs-first-technical-report-eliciting-latent-knowledge/ technical-report Formalizes the knowledge-output gap. Diamond vault thought experiment. Propose-and-counterexample methodology.
id title author date url format notes
PC06 Where I agree and disagree with Eliezer Paul Christiano 2022 https://www.lesswrong.com/posts/CoZhXrhpQxpy9xw9y/where-i-agree-and-disagree-with-eliezer blog Systematic response to AGI Ruin. Key disagreements: learning from experimentation, prosaic vs fundamental, pivotal acts.
id title author date url format notes
PC07 Thoughts on responsible scaling policies and regulation Paul Christiano 2023 https://www.alignmentforum.org/posts/dxgEaDrEBkkE96CXr/thoughts-on-responsible-scaling-policies-and-regulation blog RSP framework design. Voluntary commitments useful but insufficient. Correctly predicted failure under competitive pressure.
id title author date url format notes
PC08 Yudkowsky and Christiano discuss Takeoff Speeds Eliezer Yudkowsky, Paul Christiano 2021-11-22 https://intelligence.org/2021/11/22/yudkowsky-and-christiano-discuss-takeoff-speeds/ debate Formal debate. Christiano: continuous takeoff, investment fills gaps. Yudkowsky: recursive self-improvement creates discontinuity.
Phase 2 of 5-phase AI alignment research program. Christiano represents the empirical/prosaic counter-position to Yudkowsky's doom thesis. Key gap in KB: zero direct Christiano claims despite extensive RLHF critique coverage. Pre-screening: ~30% overlap with existing claims (scalable oversight, voluntary coordination collapse, RLHF failures). 4 NEW claims + 1 enrichment expected.

Paul Christiano — Core Alignment Research

Paul Christiano (PhD UC Berkeley, statistical learning theory) co-founded OpenAI's alignment team, co-authored the foundational RLHF paper (Christiano et al. 2017), founded the Alignment Research Center (ARC), led ARC Evals (now METR), and briefly headed AI safety at NIST/AISI. He is one of Anthropic's Long-Term Benefit Trust trustees.

Christiano occupies the most important counter-position to Yudkowsky in alignment research. Where Yudkowsky argues alignment is impossibly hard and requires fundamental theoretical breakthroughs, Christiano argues alignment can make meaningful progress through empirical iteration within current ML paradigms. His specific proposals — debate, IDA, ELK — form a coherent research agenda built on one foundational assumption: verification is easier than generation, and this asymmetry can be exploited for scalable oversight.

Key Positions

Prosaic alignment (2016): AGI will likely emerge from scaling current approaches. Alignment research should focus on techniques compatible with these systems (RLHF, debate, amplification) rather than waiting for fundamentally new architectures.

AI safety via debate (2018): Two AI systems debate, human judges. Truth-telling dominates under optimal play because a truthful debater can always expose deception. Theoretical result: debate amplifies human judgment to PSPACE with poly-time judges. Empirical result: minimal (MNIST at publication). Subsequent: 2025 Scaling Laws for Scalable Oversight shows 51.7% success at Elo 400 gap.

IDA (2018): Train model to imitate human. Use model to help human tackle harder problems. Train new model to imitate the amplified team. Iterate. Alignment preserved because human stays in loop. Key risk: distillation errors compound across iterations.

ELK (2021): Formalizes the gap between what an AI "knows" internally and what it reports. The diamond vault thought experiment: a tampered camera AI predicts "diamond is safe" (matching camera) while its internal model "knows" the camera was tampered with. Linear probing achieves 89% recovery of model-internal knowledge independent of model outputs (subsequent empirical work).

Catastrophic risk: ~10-20% probability of AI takeover resulting in most humans dead. ~50/50 chance of doom shortly after human-level AI. Far more concerned than typical industry estimates (1-5%) but far less confident in doom than Yudkowsky (~99%).

Takeoff speed: Gradual/continuous. "Before we have an incredibly intelligent AI, we will probably have a slightly worse AI." But "slow" doesn't mean slow in absolute terms — ~1 year doubling time for AI impact once human-level reached. Assigns ~1/3 probability to fast takeoff.

Relationship to Our KB

The KB has ~89 claims in ai-alignment with extensive RLHF critique (sycophancy, single-reward limitations, preference diversity) and Yudkowsky's core arguments (sharp left turn, verification asymmetry, multipolar instability). Zero direct Christiano claims. This is like having Newton's critics without Newton. The most important tension: Christiano's "verification easier than generation" vs Yudkowsky's "verification asymmetry breaks at superhuman scale." The scalable oversight claim provides the empirical middle ground between these positions.