teleo-codex/inbox/archive/2018-05-02-irving-christiano-amodei-ai-safety-via-debate.md
Theseus f2bfe00ad2 theseus: archive 9 primary sources for alignment research program (#2420)
Co-authored-by: Theseus <theseus@agents.livingip.xyz>
Co-committed-by: Theseus <theseus@agents.livingip.xyz>
2026-04-05 22:51:11 +00:00

65 lines
4.2 KiB
Markdown

---
type: source
title: "AI Safety via Debate"
author: "Geoffrey Irving, Paul Christiano, Dario Amodei"
url: https://arxiv.org/abs/1805.00899
date: 2018-05-02
domain: ai-alignment
intake_tier: research-task
rationale: "Foundational scalable oversight mechanism. Theoretical basis for debate-as-alignment — polynomial-time judges can verify PSPACE claims through adversarial debate. Phase 2 alignment research program."
proposed_by: Theseus
format: paper
status: processed
processed_by: theseus
processed_date: 2026-04-05
claims_extracted:
- "verification is easier than generation up to a capability-dependent ceiling because debate and recursive reward modeling enable polynomial-time human judges to verify claims that would require exponentially more computation to generate from scratch but this asymmetry degrades as AI capability outpaces human ability to evaluate arguments"
enrichments:
- "scalable oversight degrades predictably as the capability gap between AI systems and human evaluators widens because evaluation accuracy depends on the evaluators ability to understand the solution space which shrinks relative to the systems capability frontier"
tags: [alignment, debate, scalable-oversight, PSPACE, verification, adversarial]
---
# AI Safety via Debate
Published as an arXiv preprint in May 2018 by Geoffrey Irving, Paul Christiano, and Dario Amodei. This paper proposes training AI systems through adversarial debate as a scalable oversight mechanism.
## Core Mechanism
Two AI agents alternate making arguments in response to a question, constrained by length limits. A human judge evaluates which agent provided more truthful and useful information. The key insight: **adversarial dynamics incentivize honesty** because any deceptive argument can be exposed by the opposing agent.
The training procedure:
1. Two agents play a zero-sum debate game
2. Each agent tries to convince a human judge
3. The agent judged more truthful receives the reward
4. Both agents are trained via self-play
## Theoretical Foundation
The paper's strongest result connects debate to computational complexity theory: **debate with optimal play can answer any question in PSPACE given polynomial-time judges**. This means:
- A polynomial-time human judge can verify claims that would require exponentially more computation to evaluate independently
- The adversarial structure forces honesty because lies can be profitably exposed
- This provides a theoretical basis for scalable oversight — human verification doesn't need to match AI capability
The complexity-theoretic argument: in an interactive proof system, the verifier's computational requirements scale polynomially even when the claims being verified require exponential computation. Debate implements this for natural language claims about AI behavior.
## Empirical Results
Testing on MNIST classification (a proof of concept):
- Competing agents select pixels to reveal to a judge
- Accuracy improved from 59.4% to 88.9% using 6 pixels
- Accuracy improved from 48.2% to 85.2% using 4 pixels
- Adversarial selection dramatically outperformed random pixel selection
## Limitations and Open Questions
1. **Human judge quality**: The theoretical guarantee assumes an honest, competent judge. Real humans have cognitive biases that debaters could exploit.
2. **Argument complexity**: Some truths may require long chains of reasoning that exceed human attention span.
3. **Collusion**: Both agents might converge on the same deceptive response if it's the equilibrium of the debate game.
4. **Scalability**: The MNIST results are encouraging but the gap from toy tasks to real alignment is enormous.
## Significance
This paper is the theoretical basis for the entire "scalable oversight" research agenda. It was co-authored by the future heads of the two leading alignment organizations (Christiano → ARC, Amodei → Anthropic), and its ideas directly influenced constitutional AI, RLHF debate variants, and recursive reward modeling.
The key tension: the PSPACE theoretical guarantee is powerful but assumes optimal play. In practice, empirical results show scalable oversight degrades as the capability gap widens (the 50% accuracy finding at moderate gaps from the 2025 scaling laws paper). This gap between theory and practice is one of the central tensions in the KB.