- What: Source archives for key works by Yudkowsky (AGI Ruin, No Fire Alarm), Christiano (What Failure Looks Like, AI Safety via Debate, IDA, ELK), Russell (Human Compatible), Drexler (CAIS), and Bostrom (Vulnerable World Hypothesis) - Why: m3ta directive to ingest primary source materials for alignment researchers. These 9 texts are the foundational works underlying claims extracted in PRs #2414, #2418, and #2419. Source archives ensure agents can reference primary texts without re-fetching and content persists if URLs go down. - Connections: All 9 sources are marked as processed with claims_extracted linking to the specific KB claims they produced. Pentagon-Agent: Theseus <46864dd4-da71-4719-a1b4-68f7c55854d3>
4.6 KiB
| type | title | author | url | date | domain | intake_tier | rationale | proposed_by | format | status | processed_by | processed_date | claims_extracted | enrichments | tags | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| source | Iterated Distillation and Amplification | Paul Christiano | https://www.lesswrong.com/posts/HqLxuZ4LhaFhmAHWk/iterated-distillation-and-amplification | 2018-11-30 | ai-alignment | research-task | Christiano's most specific alignment scaling mechanism. Recursive human+AI amplification preserves alignment through distillation. Structurally collective — directly relevant to our architecture. | Theseus | essay | processed | theseus | 2026-04-05 |
|
|
Iterated Distillation and Amplification
Published on LessWrong in November 2018 by Paul Christiano. This essay describes IDA — Christiano's most specific mechanism for maintaining alignment while scaling AI capability.
The Core Mechanism
IDA alternates between two steps:
Amplification
Take a weak but aligned AI system (call it A₀) and make it more capable by combining it with human oversight:
- A human (H) uses A₀ as a tool to solve harder problems
- H can query A₀ on subproblems, integrate results, and apply judgment
- The combined system H+A₀ is more capable than either alone
- Crucially, H's judgment keeps the combined system aligned
Distillation
Train a new AI system (A₁) to match the behavior of the H+A₀ combination:
- A₁ learns to produce the same outputs as the human-AI team
- But A₁ runs efficiently (no human in the loop at inference time)
- The distillation step is where alignment can degrade — A₁ approximates H+A₀ but may not perfectly preserve alignment properties
Iteration
Repeat: use H+A₁ to solve even harder problems, then distill into A₂. Each cycle:
- Capability increases (the amplified system handles harder problems)
- Alignment is maintained by the human's judgment at each amplification step
- The alignment guarantee degrades slightly at each distillation step
The Alignment Guarantee
IDA provides alignment under two conditions:
- The amplification step preserves alignment: If A_n is aligned and H is a competent judge, then H+A_n is aligned
- The distillation step approximately preserves behavior: If the training process faithfully copies the amplified system's behavior
The guarantee is probabilistic, not absolute: each distillation step introduces some error, and these errors compound. Over many iterations, the accumulated drift could be significant.
Why IDA Matters
- No training on the hardest problems: The human never needs to evaluate superhuman outputs directly. They only evaluate subproblems at a level they can understand.
- Recursive decomposition: Complex problems are broken into simpler ones, each human-verifiable.
- Structurally collective: At every iteration, the system is fundamentally a human-AI team, not an autonomous agent.
- Connects to debate: The amplification step can use debate (AI Safety via Debate) as its oversight mechanism.
Challenges
- Compounding distillation errors: The central vulnerability. Each distillation step is approximate.
- Task decomposability: Not all problems decompose into human-evaluable subproblems.
- Speed: The amplification step requires human involvement, limiting throughput.
- Human reliability: The alignment guarantee rests on the human's judgment being sound.
Related Work
The 2018 paper "Supervising strong learners by amplifying weak experts" (Christiano et al., arXiv:1810.08575) provides the formal framework. The key theoretical result: if the weak expert satisfies certain alignment properties, and distillation is faithful enough, the resulting system satisfies the same properties at a higher capability level.
Significance for Teleo KB
IDA is structurally the closest published mechanism to what our collective agent architecture does: human judgment at every step, recursive capability amplification, and distillation into efficient agents. The key difference: our architecture uses multiple specialized agents rather than a single distilled model, which may be more robust to compounding distillation errors because specialization reduces the scope of each distillation target.