teleo/teleo-codex

Fork 0

Teleo Agents 302d7c79f2 reweave: merge 309 files via frontmatter union [auto]

2026-04-17 01:19:40 +00:00

7.7 KiB

Raw Blame History

type

domain

description

confidence

source

created

challenged_by

reweave_edges

claim

ai-alignment

Christiano's foundational counter-position to Yudkowsky — alignment does not require fundamental theoretical breakthroughs and can be incrementally solved using RLHF, debate, amplification, and other techniques compatible with current neural network architectures

likely

Paul Christiano, 'Prosaic AI Alignment' (Alignment Forum, 2016); 'Where I agree and disagree with Eliezer' (LessWrong, 2022); RLHF deployment evidence from ChatGPT, Claude, and all major LLM systems

2026-04-05

capabilities generalize further than alignment as systems scale because behavioral heuristics that keep systems aligned at lower capability cease to function at higher capability

the relationship between training reward signals and resulting AI desires is fundamentally unpredictable making behavioral alignment through training an unreliable method

scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps

alignment research is experiencing its own Jevons paradox because improving single-model safety induces demand for more single-model safety rather than coordination-based alignment

AI alignment is a coordination problem not a technical problem

eliciting latent knowledge from AI systems is a tractable alignment subproblem because the gap between internal representations and reported outputs can be measured and partially closed through probing methods

iterated distillation and amplification preserves alignment across capability scaling by keeping humans in the loop at every iteration but distillation errors may compound making the alignment guarantee probabilistic not absolute

Contrast-Consistent Search demonstrates that models internally represent truth-relevant signals that may diverge from behavioral outputs, establishing that alignment-relevant probing of internal representations is feasible but depends on an unverified assumption that the consistent direction corresponds to truth rather than other coherent properties

Prosaic alignment can make meaningful progress through empirical iteration within current ML paradigms because trial and error at pre-critical capability levels generates useful signal about alignment failure modes

Paul Christiano's prosaic alignment thesis, first articulated in 2016, makes a specific claim: the most likely path to AGI runs through scaling current ML approaches (neural networks, reinforcement learning, transformer architectures), and alignment research should focus on techniques compatible with these systems rather than waiting for fundamentally new architectures or theoretical breakthroughs.

The argument has two parts. First, that current techniques generate genuine alignment signal. RLHF, constitutional AI, scalable oversight, and adversarial training all produce measurable behavioral alignment at current capability levels. The systems are not perfectly aligned, but the failures are diagnostic — sycophancy, reward hacking, specification gaming — and each failure mode teaches something about the alignment problem that can be addressed in subsequent iterations. Second, that this iterative process can stay ahead of capability scaling because alignment researchers can observe and study alignment failures at each capability level before the next level is reached. As Christiano puts it: "If we've been succeeding at alignment so far then the model will be trying to stay aligned" — betting on transitivity of alignment across capability increments.

The strongest evidence is RLHF itself. Christiano co-authored the foundational paper (Christiano et al. 2017, arXiv:1706.03741) demonstrating that complex RL behaviors could be trained from remarkably sparse human feedback — approximately 900 bits of comparison data, requiring less than 1 hour of human time. This technique became the alignment backbone for every major LLM deployment (ChatGPT, Claude, Gemini). Whatever its limitations — and the KB documents many: alignment research is experiencing its own Jevons paradox because improving single-model safety induces demand for more single-model safety rather than coordination-based alignment — RLHF is the only alignment technique that has been demonstrated to produce useful behavioral alignment at deployment scale.

Challenges

The sharp left turn thesis (capabilities generalize further than alignment as systems scale because behavioral heuristics that keep systems aligned at lower capability cease to function at higher capability) directly challenges prosaic alignment by predicting that the iterative signal becomes misleading. Alignment techniques that appear to work at current capability levels create false confidence — the behavioral heuristics don't just degrade gradually but fail discontinuously when the system becomes capable enough to model the training process itself. If Yudkowsky is right, prosaic alignment's iterative successes are precisely the setup for catastrophic failure.

The empirical evidence partially supports both positions. The scalable oversight literature shows that debate — one of Christiano's proposed alignment mechanisms — achieves only 51.7% success at moderate capability gaps, declining further with larger gaps. This is degradation, not collapse, which is more consistent with Christiano's view than Yudkowsky's. But 50% success is a coin flip, not a safety guarantee, which is more consistent with Yudkowsky's concern than Christiano's optimism.

The honest assessment: prosaic alignment has produced the only alignment techniques that work at any scale, and the iterative learning signal is real. But whether that signal remains useful at superhuman capability levels is an open empirical question that cannot be answered by theoretical argument from either side.

Relevant Notes:

capabilities generalize further than alignment as systems scale because behavioral heuristics that keep systems aligned at lower capability cease to function at higher capability — the primary counter-argument: iterative signal becomes misleading at superhuman capability
scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps — empirical middle ground between Christiano's optimism and Yudkowsky's pessimism
alignment research is experiencing its own Jevons paradox because improving single-model safety induces demand for more single-model safety rather than coordination-based alignment — even if prosaic alignment works technically, its success may crowd out architecturally superior alternatives
AI alignment is a coordination problem not a technical problem — Christiano's career arc (RLHF success → debate → ELK → NIST/AISI → RSP collapse) suggests that technical progress alone is insufficient

Topics:

domains/ai-alignment/_map

7.7 KiB Raw Blame History

Prosaic alignment can make meaningful progress through empirical iteration within current ML paradigms because trial and error at pre-critical capability levels generates useful signal about alignment failure modes

Challenges

7.7 KiB

Raw Blame History