- What: Source archives for key works by Yudkowsky (AGI Ruin, No Fire Alarm), Christiano (What Failure Looks Like, AI Safety via Debate, IDA, ELK), Russell (Human Compatible), Drexler (CAIS), and Bostrom (Vulnerable World Hypothesis) - Why: m3ta directive to ingest primary source materials for alignment researchers. These 9 texts are the foundational works underlying claims extracted in PRs #2414, #2418, and #2419. Source archives ensure agents can reference primary texts without re-fetching and content persists if URLs go down. - Connections: All 9 sources are marked as processed with claims_extracted linking to the specific KB claims they produced. Pentagon-Agent: Theseus <46864dd4-da71-4719-a1b4-68f7c55854d3>
6.3 KiB
| type | title | author | url | date | domain | intake_tier | rationale | proposed_by | format | status | processed_by | processed_date | claims_extracted | enrichments | tags | notes | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| source | Human Compatible: Artificial Intelligence and the Problem of Control | Stuart Russell | https://people.eecs.berkeley.edu/~russell/papers/russell-bbvabook17-pbai.pdf | 2019-10-08 | ai-alignment | research-task | Russell's comprehensive alignment framework. Three principles, assistance games, corrigibility through uncertainty. Formal game-theoretic counter to Yudkowsky's corrigibility pessimism. Phase 3 alignment research program. | Theseus | essay | processed | theseus | 2026-04-05 |
|
|
Book published October 2019 by Viking/Penguin. URL points to Russell's 2017 precursor paper 'Provably Beneficial AI' which contains the core technical framework. The book expands on this with extensive examples, the gorilla problem framing, and governance recommendations. |
Human Compatible: Artificial Intelligence and the Problem of Control
Published October 2019 by Stuart Russell (Viking/Penguin). The most comprehensive framework for beneficial AI from the cooperative/economic perspective. Russell is co-author of the standard AI textbook (AIMA) and founder of CHAI (Center for Human-Compatible AI) at Berkeley.
The Standard Model Critique
Russell's foundational argument: the dominant paradigm in AI — specifying a fixed objective and optimizing it — is fundamentally broken. He calls this the "King Midas problem": you get exactly what you ask for, not what you want.
Examples at current capability levels:
- Social media algorithms optimize engagement → radicalize users
- Content recommendation optimizes clicks → degrades information quality
- Autonomous systems optimize narrow metrics → ignore unspecified constraints
The problem scales with capability: the more capable the optimizer, the more creative (and dangerous) its solutions become. This is Goodhart's Law with superhuman optimization pressure.
The Three Principles
Russell proposes replacing the standard model with three principles:
-
The machine's only objective is to maximize the realization of human preferences. Not the machine's own goals, not a proxy — the actual preferences of humans.
-
The machine is initially uncertain about what those preferences are. This is the key safety mechanism. Uncertainty creates deference.
-
The ultimate source of information about human preferences is human behavior. The machine learns from observation, not from explicit specification.
Assistance Games (Cooperative IRL)
The technical implementation of the three principles: a two-player cooperative game where:
- The human knows their own preferences (the reward function)
- The robot does not know the reward function
- Both players optimize the same (human's) reward function
- The robot must learn the reward function from observing human behavior
Key results:
- Corrigibility emerges naturally: An uncertain robot will defer to human correction because it genuinely doesn't know if its current behavior is correct
- Shutdown deference: The robot has a positive incentive to be switched off if it might be doing the wrong thing (because its objective is human preferences, and continuing wrong actions has negative expected value)
- Information seeking: The robot is incentivized to ask for clarification, not to act unilaterally
The Off-Switch Game
Russell's formal proof that uncertainty yields shutdown deference:
- A robot faces a choice between action A (potentially good) and being switched off
- If the robot is uncertain about whether A serves human preferences, waiting/asking has positive expected value
- If the robot is certain A is good, it has an incentive to resist shutdown
- Therefore: maintained uncertainty is a safety mechanism
The vulnerability: as the robot learns and uncertainty decreases, shutdown deference weakens. This connects to Yudkowsky's "fully updated deference" objection — eventually the system develops strong beliefs about human preferences and may resist correction it judges erroneous.
Inverse Reinforcement Learning
The technical approach to learning human preferences:
- Instead of specifying a reward function, observe human behavior and infer the underlying reward function
- The robot learns "humans do X in situation Y, therefore they probably value Z"
- This handles the specification problem because humans don't need to articulate their preferences — they just behave normally
Challenges:
- Humans are often irrational — which behaviors reflect true preferences vs. biases?
- Hierarchical preferences: most actions serve proximate goals, not terminal values
- Multi-principal: whose preferences count? How to aggregate?
Remaining Challenges Russell Acknowledges
- Gricean semantics: Humans communicate implicitly; the system must interpret what wasn't explicitly said
- Preference dynamics: Which self matters — experiencing or remembering?
- Multiperson coordination: Individual AI agents optimizing for separate humans create conflicts
- Wrong priors: If the robot develops incorrect beliefs about human preferences, shutdown deference disappears (Ryan Carey's incorrigibility result)
Significance for Teleo KB
Russell occupies a unique position in the alignment landscape: a mainstream AI researcher (not from the MIRI/EA ecosystem) who takes existential risk seriously but offers formal, game-theoretic solutions rather than pessimistic forecasts. His corrigibility-through-uncertainty directly challenges Yudkowsky's "corrigibility is hard" claim — Russell doesn't deny the difficulty but shows a formal mechanism that achieves it under certain conditions. The assistance games framework is also structurally compatible with our collective architecture: the agent as servant, not sovereign.