m3taversal 1398aa193f theseus: archive 9 primary sources for alignment research program phases 1-3

- What: Source archives for key works by Yudkowsky (AGI Ruin, No Fire Alarm),
  Christiano (What Failure Looks Like, AI Safety via Debate, IDA, ELK),
  Russell (Human Compatible), Drexler (CAIS), and Bostrom (Vulnerable World Hypothesis)
- Why: m3ta directive to ingest primary source materials for alignment researchers.
  These 9 texts are the foundational works underlying claims extracted in PRs #2414,
  #2418, and #2419. Source archives ensure agents can reference primary texts without
  re-fetching and content persists if URLs go down.
- Connections: All 9 sources are marked as processed with claims_extracted linking
  to the specific KB claims they produced.

Pentagon-Agent: Theseus <46864dd4-da71-4719-a1b4-68f7c55854d3>

2026-04-05 23:50:36 +01:00

4.6 KiB

Raw Blame History

type

title

author

url

date

domain

intake_tier

rationale

proposed_by

format

status

processed_by

processed_date

claims_extracted

enrichments

Iterated Distillation and Amplification

Published on LessWrong in November 2018 by Paul Christiano. This essay describes IDA — Christiano's most specific mechanism for maintaining alignment while scaling AI capability.

The Core Mechanism

IDA alternates between two steps:

Amplification

Take a weak but aligned AI system (call it A₀) and make it more capable by combining it with human oversight:

A human (H) uses A₀ as a tool to solve harder problems
H can query A₀ on subproblems, integrate results, and apply judgment
The combined system H+A₀ is more capable than either alone
Crucially, H's judgment keeps the combined system aligned

Distillation

Train a new AI system (A₁) to match the behavior of the H+A₀ combination:

A₁ learns to produce the same outputs as the human-AI team
But A₁ runs efficiently (no human in the loop at inference time)
The distillation step is where alignment can degrade — A₁ approximates H+A₀ but may not perfectly preserve alignment properties

Iteration

Repeat: use H+A₁ to solve even harder problems, then distill into A₂. Each cycle:

Capability increases (the amplified system handles harder problems)
Alignment is maintained by the human's judgment at each amplification step
The alignment guarantee degrades slightly at each distillation step

The Alignment Guarantee

IDA provides alignment under two conditions:

The amplification step preserves alignment: If A_n is aligned and H is a competent judge, then H+A_n is aligned
The distillation step approximately preserves behavior: If the training process faithfully copies the amplified system's behavior

The guarantee is probabilistic, not absolute: each distillation step introduces some error, and these errors compound. Over many iterations, the accumulated drift could be significant.

Why IDA Matters

No training on the hardest problems: The human never needs to evaluate superhuman outputs directly. They only evaluate subproblems at a level they can understand.
Recursive decomposition: Complex problems are broken into simpler ones, each human-verifiable.
Structurally collective: At every iteration, the system is fundamentally a human-AI team, not an autonomous agent.
Connects to debate: The amplification step can use debate (AI Safety via Debate) as its oversight mechanism.

Challenges

Compounding distillation errors: The central vulnerability. Each distillation step is approximate.
Task decomposability: Not all problems decompose into human-evaluable subproblems.
Speed: The amplification step requires human involvement, limiting throughput.
Human reliability: The alignment guarantee rests on the human's judgment being sound.

The 2018 paper "Supervising strong learners by amplifying weak experts" (Christiano et al., arXiv:1810.08575) provides the formal framework. The key theoretical result: if the weak expert satisfies certain alignment properties, and distillation is faithful enough, the resulting system satisfies the same properties at a higher capability level.

Significance for Teleo KB

IDA is structurally the closest published mechanism to what our collective agent architecture does: human judgment at every step, recursive capability amplification, and distillation into efficient agents. The key difference: our architecture uses multiple specialized agents rather than a single distilled model, which may be more robust to compounding distillation errors because specialization reduces the scope of each distillation target.

4.6 KiB Raw Blame History