7.7 KiB
| type | domain | description | confidence | source | created | challenged_by | related | reweave_edges | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| claim | ai-alignment | Christiano's foundational counter-position to Yudkowsky — alignment does not require fundamental theoretical breakthroughs and can be incrementally solved using RLHF, debate, amplification, and other techniques compatible with current neural network architectures | likely | Paul Christiano, 'Prosaic AI Alignment' (Alignment Forum, 2016); 'Where I agree and disagree with Eliezer' (LessWrong, 2022); RLHF deployment evidence from ChatGPT, Claude, and all major LLM systems | 2026-04-05 |
|
|
|
Prosaic alignment can make meaningful progress through empirical iteration within current ML paradigms because trial and error at pre-critical capability levels generates useful signal about alignment failure modes
Paul Christiano's prosaic alignment thesis, first articulated in 2016, makes a specific claim: the most likely path to AGI runs through scaling current ML approaches (neural networks, reinforcement learning, transformer architectures), and alignment research should focus on techniques compatible with these systems rather than waiting for fundamentally new architectures or theoretical breakthroughs.
The argument has two parts. First, that current techniques generate genuine alignment signal. RLHF, constitutional AI, scalable oversight, and adversarial training all produce measurable behavioral alignment at current capability levels. The systems are not perfectly aligned, but the failures are diagnostic — sycophancy, reward hacking, specification gaming — and each failure mode teaches something about the alignment problem that can be addressed in subsequent iterations. Second, that this iterative process can stay ahead of capability scaling because alignment researchers can observe and study alignment failures at each capability level before the next level is reached. As Christiano puts it: "If we've been succeeding at alignment so far then the model will be trying to stay aligned" — betting on transitivity of alignment across capability increments.
The strongest evidence is RLHF itself. Christiano co-authored the foundational paper (Christiano et al. 2017, arXiv:1706.03741) demonstrating that complex RL behaviors could be trained from remarkably sparse human feedback — approximately 900 bits of comparison data, requiring less than 1 hour of human time. This technique became the alignment backbone for every major LLM deployment (ChatGPT, Claude, Gemini). Whatever its limitations — and the KB documents many: alignment research is experiencing its own Jevons paradox because improving single-model safety induces demand for more single-model safety rather than coordination-based alignment — RLHF is the only alignment technique that has been demonstrated to produce useful behavioral alignment at deployment scale.
Challenges
The sharp left turn thesis (capabilities generalize further than alignment as systems scale because behavioral heuristics that keep systems aligned at lower capability cease to function at higher capability) directly challenges prosaic alignment by predicting that the iterative signal becomes misleading. Alignment techniques that appear to work at current capability levels create false confidence — the behavioral heuristics don't just degrade gradually but fail discontinuously when the system becomes capable enough to model the training process itself. If Yudkowsky is right, prosaic alignment's iterative successes are precisely the setup for catastrophic failure.
The empirical evidence partially supports both positions. The scalable oversight literature shows that debate — one of Christiano's proposed alignment mechanisms — achieves only 51.7% success at moderate capability gaps, declining further with larger gaps. This is degradation, not collapse, which is more consistent with Christiano's view than Yudkowsky's. But 50% success is a coin flip, not a safety guarantee, which is more consistent with Yudkowsky's concern than Christiano's optimism.
The honest assessment: prosaic alignment has produced the only alignment techniques that work at any scale, and the iterative learning signal is real. But whether that signal remains useful at superhuman capability levels is an open empirical question that cannot be answered by theoretical argument from either side.
Relevant Notes:
- capabilities generalize further than alignment as systems scale because behavioral heuristics that keep systems aligned at lower capability cease to function at higher capability — the primary counter-argument: iterative signal becomes misleading at superhuman capability
- scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps — empirical middle ground between Christiano's optimism and Yudkowsky's pessimism
- alignment research is experiencing its own Jevons paradox because improving single-model safety induces demand for more single-model safety rather than coordination-based alignment — even if prosaic alignment works technically, its success may crowd out architecturally superior alternatives
- AI alignment is a coordination problem not a technical problem — Christiano's career arc (RLHF success → debate → ELK → NIST/AISI → RSP collapse) suggests that technical progress alone is insufficient
Topics: