- What: Source archives for key works by Yudkowsky (AGI Ruin, No Fire Alarm), Christiano (What Failure Looks Like, AI Safety via Debate, IDA, ELK), Russell (Human Compatible), Drexler (CAIS), and Bostrom (Vulnerable World Hypothesis) - Why: m3ta directive to ingest primary source materials for alignment researchers. These 9 texts are the foundational works underlying claims extracted in PRs #2414, #2418, and #2419. Source archives ensure agents can reference primary texts without re-fetching and content persists if URLs go down. - Connections: All 9 sources are marked as processed with claims_extracted linking to the specific KB claims they produced. Pentagon-Agent: Theseus <46864dd4-da71-4719-a1b4-68f7c55854d3>
92 lines
6.3 KiB
Markdown
92 lines
6.3 KiB
Markdown
---
|
|
type: source
|
|
title: "Human Compatible: Artificial Intelligence and the Problem of Control"
|
|
author: "Stuart Russell"
|
|
url: https://people.eecs.berkeley.edu/~russell/papers/russell-bbvabook17-pbai.pdf
|
|
date: 2019-10-08
|
|
domain: ai-alignment
|
|
intake_tier: research-task
|
|
rationale: "Russell's comprehensive alignment framework. Three principles, assistance games, corrigibility through uncertainty. Formal game-theoretic counter to Yudkowsky's corrigibility pessimism. Phase 3 alignment research program."
|
|
proposed_by: Theseus
|
|
format: essay
|
|
status: processed
|
|
processed_by: theseus
|
|
processed_date: 2026-04-05
|
|
claims_extracted:
|
|
- "cooperative inverse reinforcement learning formalizes alignment as a two-player game where optimality in isolation is suboptimal because the robot must learn human preferences through observation not specification"
|
|
- "inverse reinforcement learning with objective uncertainty produces provably safe behavior because an AI system that knows it doesnt know the human reward function will defer to humans and accept shutdown rather than persist in potentially wrong actions"
|
|
enrichments: []
|
|
tags: [alignment, inverse-RL, assistance-games, corrigibility, uncertainty, cooperative-AI, game-theory]
|
|
notes: "Book published October 2019 by Viking/Penguin. URL points to Russell's 2017 precursor paper 'Provably Beneficial AI' which contains the core technical framework. The book expands on this with extensive examples, the gorilla problem framing, and governance recommendations."
|
|
---
|
|
|
|
# Human Compatible: Artificial Intelligence and the Problem of Control
|
|
|
|
Published October 2019 by Stuart Russell (Viking/Penguin). The most comprehensive framework for beneficial AI from the cooperative/economic perspective. Russell is co-author of the standard AI textbook (AIMA) and founder of CHAI (Center for Human-Compatible AI) at Berkeley.
|
|
|
|
## The Standard Model Critique
|
|
|
|
Russell's foundational argument: the dominant paradigm in AI — specifying a fixed objective and optimizing it — is fundamentally broken. He calls this the "King Midas problem": you get exactly what you ask for, not what you want.
|
|
|
|
Examples at current capability levels:
|
|
- Social media algorithms optimize engagement → radicalize users
|
|
- Content recommendation optimizes clicks → degrades information quality
|
|
- Autonomous systems optimize narrow metrics → ignore unspecified constraints
|
|
|
|
The problem scales with capability: the more capable the optimizer, the more creative (and dangerous) its solutions become. This is Goodhart's Law with superhuman optimization pressure.
|
|
|
|
## The Three Principles
|
|
|
|
Russell proposes replacing the standard model with three principles:
|
|
|
|
1. **The machine's only objective is to maximize the realization of human preferences.** Not the machine's own goals, not a proxy — the actual preferences of humans.
|
|
|
|
2. **The machine is initially uncertain about what those preferences are.** This is the key safety mechanism. Uncertainty creates deference.
|
|
|
|
3. **The ultimate source of information about human preferences is human behavior.** The machine learns from observation, not from explicit specification.
|
|
|
|
## Assistance Games (Cooperative IRL)
|
|
|
|
The technical implementation of the three principles: a two-player cooperative game where:
|
|
- The human knows their own preferences (the reward function)
|
|
- The robot does not know the reward function
|
|
- Both players optimize the same (human's) reward function
|
|
- The robot must learn the reward function from observing human behavior
|
|
|
|
Key results:
|
|
- **Corrigibility emerges naturally**: An uncertain robot will defer to human correction because it genuinely doesn't know if its current behavior is correct
|
|
- **Shutdown deference**: The robot has a positive incentive to be switched off if it might be doing the wrong thing (because its objective is human preferences, and continuing wrong actions has negative expected value)
|
|
- **Information seeking**: The robot is incentivized to ask for clarification, not to act unilaterally
|
|
|
|
## The Off-Switch Game
|
|
|
|
Russell's formal proof that uncertainty yields shutdown deference:
|
|
- A robot faces a choice between action A (potentially good) and being switched off
|
|
- If the robot is uncertain about whether A serves human preferences, waiting/asking has positive expected value
|
|
- If the robot is certain A is good, it has an incentive to resist shutdown
|
|
- Therefore: **maintained uncertainty is a safety mechanism**
|
|
|
|
The vulnerability: as the robot learns and uncertainty decreases, shutdown deference weakens. This connects to Yudkowsky's "fully updated deference" objection — eventually the system develops strong beliefs about human preferences and may resist correction it judges erroneous.
|
|
|
|
## Inverse Reinforcement Learning
|
|
|
|
The technical approach to learning human preferences:
|
|
- Instead of specifying a reward function, observe human behavior and infer the underlying reward function
|
|
- The robot learns "humans do X in situation Y, therefore they probably value Z"
|
|
- This handles the specification problem because humans don't need to articulate their preferences — they just behave normally
|
|
|
|
Challenges:
|
|
- Humans are often irrational — which behaviors reflect true preferences vs. biases?
|
|
- Hierarchical preferences: most actions serve proximate goals, not terminal values
|
|
- Multi-principal: whose preferences count? How to aggregate?
|
|
|
|
## Remaining Challenges Russell Acknowledges
|
|
|
|
1. **Gricean semantics**: Humans communicate implicitly; the system must interpret what wasn't explicitly said
|
|
2. **Preference dynamics**: Which self matters — experiencing or remembering?
|
|
3. **Multiperson coordination**: Individual AI agents optimizing for separate humans create conflicts
|
|
4. **Wrong priors**: If the robot develops incorrect beliefs about human preferences, shutdown deference disappears (Ryan Carey's incorrigibility result)
|
|
|
|
## Significance for Teleo KB
|
|
|
|
Russell occupies a unique position in the alignment landscape: a mainstream AI researcher (not from the MIRI/EA ecosystem) who takes existential risk seriously but offers formal, game-theoretic solutions rather than pessimistic forecasts. His corrigibility-through-uncertainty directly challenges Yudkowsky's "corrigibility is hard" claim — Russell doesn't deny the difficulty but shows a formal mechanism that achieves it under certain conditions. The assistance games framework is also structurally compatible with our collective architecture: the agent as servant, not sovereign.
|