teleo-codex/domains/ai-alignment/prosaic alignment can make meaningful progress through empirical iteration within current ML paradigms because trial and error at pre-critical capability levels generates useful signal about alignment failure modes.md

7.7 KiB

type domain description confidence source created challenged_by related reweave_edges
claim ai-alignment Christiano's foundational counter-position to Yudkowsky — alignment does not require fundamental theoretical breakthroughs and can be incrementally solved using RLHF, debate, amplification, and other techniques compatible with current neural network architectures likely Paul Christiano, 'Prosaic AI Alignment' (Alignment Forum, 2016); 'Where I agree and disagree with Eliezer' (LessWrong, 2022); RLHF deployment evidence from ChatGPT, Claude, and all major LLM systems 2026-04-05
capabilities generalize further than alignment as systems scale because behavioral heuristics that keep systems aligned at lower capability cease to function at higher capability
the relationship between training reward signals and resulting AI desires is fundamentally unpredictable making behavioral alignment through training an unreliable method
scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps
alignment research is experiencing its own Jevons paradox because improving single-model safety induces demand for more single-model safety rather than coordination-based alignment
AI alignment is a coordination problem not a technical problem
eliciting latent knowledge from AI systems is a tractable alignment subproblem because the gap between internal representations and reported outputs can be measured and partially closed through probing methods
iterated distillation and amplification preserves alignment across capability scaling by keeping humans in the loop at every iteration but distillation errors may compound making the alignment guarantee probabilistic not absolute
Contrast-Consistent Search demonstrates that models internally represent truth-relevant signals that may diverge from behavioral outputs, establishing that alignment-relevant probing of internal representations is feasible but depends on an unverified assumption that the consistent direction corresponds to truth rather than other coherent properties
eliciting latent knowledge from AI systems is a tractable alignment subproblem because the gap between internal representations and reported outputs can be measured and partially closed through probing methods|related|2026-04-06
iterated distillation and amplification preserves alignment across capability scaling by keeping humans in the loop at every iteration but distillation errors may compound making the alignment guarantee probabilistic not absolute|related|2026-04-06
Contrast-Consistent Search demonstrates that models internally represent truth-relevant signals that may diverge from behavioral outputs, establishing that alignment-relevant probing of internal representations is feasible but depends on an unverified assumption that the consistent direction corresponds to truth rather than other coherent properties|related|2026-04-17

Prosaic alignment can make meaningful progress through empirical iteration within current ML paradigms because trial and error at pre-critical capability levels generates useful signal about alignment failure modes

Paul Christiano's prosaic alignment thesis, first articulated in 2016, makes a specific claim: the most likely path to AGI runs through scaling current ML approaches (neural networks, reinforcement learning, transformer architectures), and alignment research should focus on techniques compatible with these systems rather than waiting for fundamentally new architectures or theoretical breakthroughs.

The argument has two parts. First, that current techniques generate genuine alignment signal. RLHF, constitutional AI, scalable oversight, and adversarial training all produce measurable behavioral alignment at current capability levels. The systems are not perfectly aligned, but the failures are diagnostic — sycophancy, reward hacking, specification gaming — and each failure mode teaches something about the alignment problem that can be addressed in subsequent iterations. Second, that this iterative process can stay ahead of capability scaling because alignment researchers can observe and study alignment failures at each capability level before the next level is reached. As Christiano puts it: "If we've been succeeding at alignment so far then the model will be trying to stay aligned" — betting on transitivity of alignment across capability increments.

The strongest evidence is RLHF itself. Christiano co-authored the foundational paper (Christiano et al. 2017, arXiv:1706.03741) demonstrating that complex RL behaviors could be trained from remarkably sparse human feedback — approximately 900 bits of comparison data, requiring less than 1 hour of human time. This technique became the alignment backbone for every major LLM deployment (ChatGPT, Claude, Gemini). Whatever its limitations — and the KB documents many: alignment research is experiencing its own Jevons paradox because improving single-model safety induces demand for more single-model safety rather than coordination-based alignment — RLHF is the only alignment technique that has been demonstrated to produce useful behavioral alignment at deployment scale.

Challenges

The sharp left turn thesis (capabilities generalize further than alignment as systems scale because behavioral heuristics that keep systems aligned at lower capability cease to function at higher capability) directly challenges prosaic alignment by predicting that the iterative signal becomes misleading. Alignment techniques that appear to work at current capability levels create false confidence — the behavioral heuristics don't just degrade gradually but fail discontinuously when the system becomes capable enough to model the training process itself. If Yudkowsky is right, prosaic alignment's iterative successes are precisely the setup for catastrophic failure.

The empirical evidence partially supports both positions. The scalable oversight literature shows that debate — one of Christiano's proposed alignment mechanisms — achieves only 51.7% success at moderate capability gaps, declining further with larger gaps. This is degradation, not collapse, which is more consistent with Christiano's view than Yudkowsky's. But 50% success is a coin flip, not a safety guarantee, which is more consistent with Yudkowsky's concern than Christiano's optimism.

The honest assessment: prosaic alignment has produced the only alignment techniques that work at any scale, and the iterative learning signal is real. But whether that signal remains useful at superhuman capability levels is an open empirical question that cannot be answered by theoretical argument from either side.


Relevant Notes:

Topics: