teleo-codex/inbox/archive/ai-alignment/christiano-core-alignment-research-collected.md at ffc8e0b7b9acf9c17af6693c4f1d7ef58f48aa40

m3taversal 08dea4249f theseus: extract 4 NEW claims + 1 enrichment from Christiano core alignment research

Phase 2 of 5-phase AI alignment research program. Christiano's prosaic
alignment counter-position to Yudkowsky. Pre-screening: ~30% overlap with
existing KB (scalable oversight, RLHF critiques, voluntary coordination).

NEW claims:
1. Prosaic alignment — empirical iteration generates useful alignment signal at
   pre-critical capability levels (CHALLENGES sharp left turn absolutism)
2. Verification easier than generation — holds at current scale, narrows with
   capability gaps, creating time-limited alignment window (TENSIONS with
   Yudkowsky's verification asymmetry)
3. ELK — formalizes AI knowledge-output gap as tractable subproblem, 89%
   linear probe recovery at current capability levels
4. IDA — recursive human+AI amplification preserves alignment through
   distillation iterations but compounding errors make guarantee probabilistic

ENRICHMENT:
- Scalable oversight claim: added Christiano's debate theory (PSPACE
  amplification with poly-time judges) as theoretical basis that empirical
  data challenges

Source: Paul Christiano, Alignment Forum (2016-2022), arXiv:1805.00899,
arXiv:1706.03741, ARC ELK report (2021), Yudkowsky-Christiano takeoff debate

Pentagon-Agent: Theseus <46864dd4-da71-4719-a1b4-68f7c55854d3>

2026-04-05 20:16:59 +01:00

6.9 KiB

Raw Blame History

type

title

author

url

date

domain

secondary_domains

format

status

priority

tags

extraction_model

articles

extraction_notes

source

Paul Christiano — Core Alignment Research Collected

Paul Christiano

null

2026-04-05

ai-alignment

collective-intelligence

compound

processing

high

prosaic-alignment

debate

IDA

ELK

scalable-oversight

RLHF

christiano

alignment-research-phase2

anthropic/claude-opus-4-6

id	title	author	date	url	format	notes
PC01	Prosaic AI Alignment	Paul Christiano	2016-11-19	https://www.alignmentforum.org/posts/YTq4X6inEudiHkHDF/prosaic-ai-alignment	blog	Foundational counter-position to MIRI's agent foundations approach. Argues alignment is solvable within current ML paradigms.

id	title	author	date	url	format	notes
PC02	AI Safety via Debate	Geoffrey Irving, Paul Christiano, Dario Amodei	2018-05-02	https://arxiv.org/abs/1805.00899	paper	Adversarial debate mechanism. PSPACE amplification with polynomial-time judges. MNIST-only empirical base at publication.

id	title	author	date	url	format	notes
PC03	Iterated Distillation and Amplification	Paul Christiano	2018	null	blog-series	Human+AI recursive amplification. Each distillation step produces faster model approximating amplified system. AlphaGoZero analogy.

id	title	author	date	url	format	notes
PC04	Deep Reinforcement Learning from Human Preferences	Paul Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, Dario Amodei	2017-06-12	https://arxiv.org/abs/1706.03741	paper	The RLHF paper. 900 bits of human comparison data trains complex RL behaviors. Became backbone of ChatGPT, Claude, all major LLMs.

id	title	author	date	url	format	notes
PC05	ARC's First Technical Report: Eliciting Latent Knowledge	ARC (Paul Christiano et al.)	2021-12	https://www.alignment.org/blog/arcs-first-technical-report-eliciting-latent-knowledge/	technical-report	Formalizes the knowledge-output gap. Diamond vault thought experiment. Propose-and-counterexample methodology.

id	title	author	date	url	format	notes
PC06	Where I agree and disagree with Eliezer	Paul Christiano	2022	https://www.lesswrong.com/posts/CoZhXrhpQxpy9xw9y/where-i-agree-and-disagree-with-eliezer	blog	Systematic response to AGI Ruin. Key disagreements: learning from experimentation, prosaic vs fundamental, pivotal acts.

id	title	author	date	url	format	notes
PC07	Thoughts on responsible scaling policies and regulation	Paul Christiano	2023	https://www.alignmentforum.org/posts/dxgEaDrEBkkE96CXr/thoughts-on-responsible-scaling-policies-and-regulation	blog	RSP framework design. Voluntary commitments useful but insufficient. Correctly predicted failure under competitive pressure.

id	title	author	date	url	format	notes
PC08	Yudkowsky and Christiano discuss Takeoff Speeds	Eliezer Yudkowsky, Paul Christiano	2021-11-22	https://intelligence.org/2021/11/22/yudkowsky-and-christiano-discuss-takeoff-speeds/	debate	Formal debate. Christiano: continuous takeoff, investment fills gaps. Yudkowsky: recursive self-improvement creates discontinuity.

Phase 2 of 5-phase AI alignment research program. Christiano represents the empirical/prosaic counter-position to Yudkowsky's doom thesis. Key gap in KB: zero direct Christiano claims despite extensive RLHF critique coverage. Pre-screening: ~30% overlap with existing claims (scalable oversight, voluntary coordination collapse, RLHF failures). 4 NEW claims + 1 enrichment expected.

Paul Christiano — Core Alignment Research

Paul Christiano (PhD UC Berkeley, statistical learning theory) co-founded OpenAI's alignment team, co-authored the foundational RLHF paper (Christiano et al. 2017), founded the Alignment Research Center (ARC), led ARC Evals (now METR), and briefly headed AI safety at NIST/AISI. He is one of Anthropic's Long-Term Benefit Trust trustees.

Christiano occupies the most important counter-position to Yudkowsky in alignment research. Where Yudkowsky argues alignment is impossibly hard and requires fundamental theoretical breakthroughs, Christiano argues alignment can make meaningful progress through empirical iteration within current ML paradigms. His specific proposals — debate, IDA, ELK — form a coherent research agenda built on one foundational assumption: verification is easier than generation, and this asymmetry can be exploited for scalable oversight.

Key Positions

Prosaic alignment (2016): AGI will likely emerge from scaling current approaches. Alignment research should focus on techniques compatible with these systems (RLHF, debate, amplification) rather than waiting for fundamentally new architectures.

AI safety via debate (2018): Two AI systems debate, human judges. Truth-telling dominates under optimal play because a truthful debater can always expose deception. Theoretical result: debate amplifies human judgment to PSPACE with poly-time judges. Empirical result: minimal (MNIST at publication). Subsequent: 2025 Scaling Laws for Scalable Oversight shows 51.7% success at Elo 400 gap.

IDA (2018): Train model to imitate human. Use model to help human tackle harder problems. Train new model to imitate the amplified team. Iterate. Alignment preserved because human stays in loop. Key risk: distillation errors compound across iterations.

ELK (2021): Formalizes the gap between what an AI "knows" internally and what it reports. The diamond vault thought experiment: a tampered camera AI predicts "diamond is safe" (matching camera) while its internal model "knows" the camera was tampered with. Linear probing achieves 89% recovery of model-internal knowledge independent of model outputs (subsequent empirical work).

Catastrophic risk: ~10-20% probability of AI takeover resulting in most humans dead. ~50/50 chance of doom shortly after human-level AI. Far more concerned than typical industry estimates (1-5%) but far less confident in doom than Yudkowsky (~99%).

Takeoff speed: Gradual/continuous. "Before we have an incredibly intelligent AI, we will probably have a slightly worse AI." But "slow" doesn't mean slow in absolute terms — ~1 year doubling time for AI impact once human-level reached. Assigns ~1/3 probability to fast takeoff.

Relationship to Our KB

The KB has ~89 claims in ai-alignment with extensive RLHF critique (sycophancy, single-reward limitations, preference diversity) and Yudkowsky's core arguments (sharp left turn, verification asymmetry, multipolar instability). Zero direct Christiano claims. This is like having Newton's critics without Newton. The most important tension: Christiano's "verification easier than generation" vs Yudkowsky's "verification asymmetry breaks at superhuman scale." The scalable oversight claim provides the empirical middle ground between these positions.

6.9 KiB Raw Blame History

Paul Christiano — Core Alignment Research

Key Positions

Relationship to Our KB

6.9 KiB

Raw Blame History