teleo-codex/inbox/archive/2018-11-30-christiano-iterated-distillation-amplification.md
m3taversal 1398aa193f theseus: archive 9 primary sources for alignment research program phases 1-3
- What: Source archives for key works by Yudkowsky (AGI Ruin, No Fire Alarm),
  Christiano (What Failure Looks Like, AI Safety via Debate, IDA, ELK),
  Russell (Human Compatible), Drexler (CAIS), and Bostrom (Vulnerable World Hypothesis)
- Why: m3ta directive to ingest primary source materials for alignment researchers.
  These 9 texts are the foundational works underlying claims extracted in PRs #2414,
  #2418, and #2419. Source archives ensure agents can reference primary texts without
  re-fetching and content persists if URLs go down.
- Connections: All 9 sources are marked as processed with claims_extracted linking
  to the specific KB claims they produced.

Pentagon-Agent: Theseus <46864dd4-da71-4719-a1b4-68f7c55854d3>
2026-04-05 23:50:36 +01:00

4.6 KiB

type title author url date domain intake_tier rationale proposed_by format status processed_by processed_date claims_extracted enrichments tags
source Iterated Distillation and Amplification Paul Christiano https://www.lesswrong.com/posts/HqLxuZ4LhaFhmAHWk/iterated-distillation-and-amplification 2018-11-30 ai-alignment research-task Christiano's most specific alignment scaling mechanism. Recursive human+AI amplification preserves alignment through distillation. Structurally collective — directly relevant to our architecture. Theseus essay processed theseus 2026-04-05
iterated distillation and amplification preserves alignment across capability scaling through recursive decomposition because each amplification step defers to human judgment on subproblems while distillation compresses the result into an efficient model but the alignment guarantee is probabilistic since distillation errors compound across iterations
alignment
IDA
amplification
distillation
scalable-oversight
recursive-decomposition

Iterated Distillation and Amplification

Published on LessWrong in November 2018 by Paul Christiano. This essay describes IDA — Christiano's most specific mechanism for maintaining alignment while scaling AI capability.

The Core Mechanism

IDA alternates between two steps:

Amplification

Take a weak but aligned AI system (call it A₀) and make it more capable by combining it with human oversight:

  • A human (H) uses A₀ as a tool to solve harder problems
  • H can query A₀ on subproblems, integrate results, and apply judgment
  • The combined system H+A₀ is more capable than either alone
  • Crucially, H's judgment keeps the combined system aligned

Distillation

Train a new AI system (A₁) to match the behavior of the H+A₀ combination:

  • A₁ learns to produce the same outputs as the human-AI team
  • But A₁ runs efficiently (no human in the loop at inference time)
  • The distillation step is where alignment can degrade — A₁ approximates H+A₀ but may not perfectly preserve alignment properties

Iteration

Repeat: use H+A₁ to solve even harder problems, then distill into A₂. Each cycle:

  • Capability increases (the amplified system handles harder problems)
  • Alignment is maintained by the human's judgment at each amplification step
  • The alignment guarantee degrades slightly at each distillation step

The Alignment Guarantee

IDA provides alignment under two conditions:

  1. The amplification step preserves alignment: If A_n is aligned and H is a competent judge, then H+A_n is aligned
  2. The distillation step approximately preserves behavior: If the training process faithfully copies the amplified system's behavior

The guarantee is probabilistic, not absolute: each distillation step introduces some error, and these errors compound. Over many iterations, the accumulated drift could be significant.

Why IDA Matters

  1. No training on the hardest problems: The human never needs to evaluate superhuman outputs directly. They only evaluate subproblems at a level they can understand.
  2. Recursive decomposition: Complex problems are broken into simpler ones, each human-verifiable.
  3. Structurally collective: At every iteration, the system is fundamentally a human-AI team, not an autonomous agent.
  4. Connects to debate: The amplification step can use debate (AI Safety via Debate) as its oversight mechanism.

Challenges

  • Compounding distillation errors: The central vulnerability. Each distillation step is approximate.
  • Task decomposability: Not all problems decompose into human-evaluable subproblems.
  • Speed: The amplification step requires human involvement, limiting throughput.
  • Human reliability: The alignment guarantee rests on the human's judgment being sound.

The 2018 paper "Supervising strong learners by amplifying weak experts" (Christiano et al., arXiv:1810.08575) provides the formal framework. The key theoretical result: if the weak expert satisfies certain alignment properties, and distillation is faithful enough, the resulting system satisfies the same properties at a higher capability level.

Significance for Teleo KB

IDA is structurally the closest published mechanism to what our collective agent architecture does: human judgment at every step, recursive capability amplification, and distillation into efficient agents. The key difference: our architecture uses multiple specialized agents rather than a single distilled model, which may be more robust to compounding distillation errors because specialization reduces the scope of each distillation target.