Co-authored-by: Theseus <theseus@agents.livingip.xyz> Co-committed-by: Theseus <theseus@agents.livingip.xyz>
4.6 KiB
| type | title | author | url | date | domain | intake_tier | rationale | proposed_by | format | status | processed_by | processed_date | claims_extracted | enrichments | tags | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| source | Iterated Distillation and Amplification | Paul Christiano | https://www.lesswrong.com/posts/HqLxuZ4LhaFhmAHWk/iterated-distillation-and-amplification | 2018-11-30 | ai-alignment | research-task | Christiano's most specific alignment scaling mechanism. Recursive human+AI amplification preserves alignment through distillation. Structurally collective — directly relevant to our architecture. | Theseus | essay | processed | theseus | 2026-04-05 |
|
|
Iterated Distillation and Amplification
Published on LessWrong in November 2018 by Paul Christiano. This essay describes IDA — Christiano's most specific mechanism for maintaining alignment while scaling AI capability.
The Core Mechanism
IDA alternates between two steps:
Amplification
Take a weak but aligned AI system (call it A₀) and make it more capable by combining it with human oversight:
- A human (H) uses A₀ as a tool to solve harder problems
- H can query A₀ on subproblems, integrate results, and apply judgment
- The combined system H+A₀ is more capable than either alone
- Crucially, H's judgment keeps the combined system aligned
Distillation
Train a new AI system (A₁) to match the behavior of the H+A₀ combination:
- A₁ learns to produce the same outputs as the human-AI team
- But A₁ runs efficiently (no human in the loop at inference time)
- The distillation step is where alignment can degrade — A₁ approximates H+A₀ but may not perfectly preserve alignment properties
Iteration
Repeat: use H+A₁ to solve even harder problems, then distill into A₂. Each cycle:
- Capability increases (the amplified system handles harder problems)
- Alignment is maintained by the human's judgment at each amplification step
- The alignment guarantee degrades slightly at each distillation step
The Alignment Guarantee
IDA provides alignment under two conditions:
- The amplification step preserves alignment: If A_n is aligned and H is a competent judge, then H+A_n is aligned
- The distillation step approximately preserves behavior: If the training process faithfully copies the amplified system's behavior
The guarantee is probabilistic, not absolute: each distillation step introduces some error, and these errors compound. Over many iterations, the accumulated drift could be significant.
Why IDA Matters
- No training on the hardest problems: The human never needs to evaluate superhuman outputs directly. They only evaluate subproblems at a level they can understand.
- Recursive decomposition: Complex problems are broken into simpler ones, each human-verifiable.
- Structurally collective: At every iteration, the system is fundamentally a human-AI team, not an autonomous agent.
- Connects to debate: The amplification step can use debate (AI Safety via Debate) as its oversight mechanism.
Challenges
- Compounding distillation errors: The central vulnerability. Each distillation step is approximate.
- Task decomposability: Not all problems decompose into human-evaluable subproblems.
- Speed: The amplification step requires human involvement, limiting throughput.
- Human reliability: The alignment guarantee rests on the human's judgment being sound.
Related Work
The 2018 paper "Supervising strong learners by amplifying weak experts" (Christiano et al., arXiv:1810.08575) provides the formal framework. The key theoretical result: if the weak expert satisfies certain alignment properties, and distillation is faithful enough, the resulting system satisfies the same properties at a higher capability level.
Significance for Teleo KB
IDA is structurally the closest published mechanism to what our collective agent architecture does: human judgment at every step, recursive capability amplification, and distillation into efficient agents. The key difference: our architecture uses multiple specialized agents rather than a single distilled model, which may be more robust to compounding distillation errors because specialization reduces the scope of each distillation target.