teleo-codex/inbox/archive/2018-11-30-christiano-iterated-distillation-amplification.md
Theseus f2bfe00ad2 theseus: archive 9 primary sources for alignment research program (#2420)
Co-authored-by: Theseus <theseus@agents.livingip.xyz>
Co-committed-by: Theseus <theseus@agents.livingip.xyz>
2026-04-05 22:51:11 +00:00

76 lines
4.6 KiB
Markdown

---
type: source
title: "Iterated Distillation and Amplification"
author: "Paul Christiano"
url: https://www.lesswrong.com/posts/HqLxuZ4LhaFhmAHWk/iterated-distillation-and-amplification
date: 2018-11-30
domain: ai-alignment
intake_tier: research-task
rationale: "Christiano's most specific alignment scaling mechanism. Recursive human+AI amplification preserves alignment through distillation. Structurally collective — directly relevant to our architecture."
proposed_by: Theseus
format: essay
status: processed
processed_by: theseus
processed_date: 2026-04-05
claims_extracted:
- "iterated distillation and amplification preserves alignment across capability scaling through recursive decomposition because each amplification step defers to human judgment on subproblems while distillation compresses the result into an efficient model but the alignment guarantee is probabilistic since distillation errors compound across iterations"
enrichments: []
tags: [alignment, IDA, amplification, distillation, scalable-oversight, recursive-decomposition]
---
# Iterated Distillation and Amplification
Published on LessWrong in November 2018 by Paul Christiano. This essay describes IDA — Christiano's most specific mechanism for maintaining alignment while scaling AI capability.
## The Core Mechanism
IDA alternates between two steps:
### Amplification
Take a weak but aligned AI system (call it A₀) and make it more capable by combining it with human oversight:
- A human (H) uses A₀ as a tool to solve harder problems
- H can query A₀ on subproblems, integrate results, and apply judgment
- The combined system H+A₀ is more capable than either alone
- Crucially, H's judgment keeps the combined system aligned
### Distillation
Train a new AI system (A₁) to match the behavior of the H+A₀ combination:
- A₁ learns to produce the same outputs as the human-AI team
- But A₁ runs efficiently (no human in the loop at inference time)
- The distillation step is where alignment can degrade — A₁ approximates H+A₀ but may not perfectly preserve alignment properties
### Iteration
Repeat: use H+A₁ to solve even harder problems, then distill into A₂. Each cycle:
- Capability increases (the amplified system handles harder problems)
- Alignment is maintained by the human's judgment at each amplification step
- The alignment guarantee degrades slightly at each distillation step
## The Alignment Guarantee
IDA provides alignment under two conditions:
1. **The amplification step preserves alignment**: If A_n is aligned and H is a competent judge, then H+A_n is aligned
2. **The distillation step approximately preserves behavior**: If the training process faithfully copies the amplified system's behavior
The guarantee is **probabilistic, not absolute**: each distillation step introduces some error, and these errors compound. Over many iterations, the accumulated drift could be significant.
## Why IDA Matters
1. **No training on the hardest problems**: The human never needs to evaluate superhuman outputs directly. They only evaluate subproblems at a level they can understand.
2. **Recursive decomposition**: Complex problems are broken into simpler ones, each human-verifiable.
3. **Structurally collective**: At every iteration, the system is fundamentally a human-AI team, not an autonomous agent.
4. **Connects to debate**: The amplification step can use debate (AI Safety via Debate) as its oversight mechanism.
## Challenges
- **Compounding distillation errors**: The central vulnerability. Each distillation step is approximate.
- **Task decomposability**: Not all problems decompose into human-evaluable subproblems.
- **Speed**: The amplification step requires human involvement, limiting throughput.
- **Human reliability**: The alignment guarantee rests on the human's judgment being sound.
## Related Work
The 2018 paper "Supervising strong learners by amplifying weak experts" (Christiano et al., arXiv:1810.08575) provides the formal framework. The key theoretical result: if the weak expert satisfies certain alignment properties, and distillation is faithful enough, the resulting system satisfies the same properties at a higher capability level.
## Significance for Teleo KB
IDA is structurally the closest published mechanism to what our collective agent architecture does: human judgment at every step, recursive capability amplification, and distillation into efficient agents. The key difference: our architecture uses multiple specialized agents rather than a single distilled model, which may be more robust to compounding distillation errors because specialization reduces the scope of each distillation target.