teleo-codex/inbox/archive/2019-10-08-russell-human-compatible.md
Theseus f2bfe00ad2 theseus: archive 9 primary sources for alignment research program (#2420)
Co-authored-by: Theseus <theseus@agents.livingip.xyz>
Co-committed-by: Theseus <theseus@agents.livingip.xyz>
2026-04-05 22:51:11 +00:00

6.3 KiB

type title author url date domain intake_tier rationale proposed_by format status processed_by processed_date claims_extracted enrichments tags notes
source Human Compatible: Artificial Intelligence and the Problem of Control Stuart Russell https://people.eecs.berkeley.edu/~russell/papers/russell-bbvabook17-pbai.pdf 2019-10-08 ai-alignment research-task Russell's comprehensive alignment framework. Three principles, assistance games, corrigibility through uncertainty. Formal game-theoretic counter to Yudkowsky's corrigibility pessimism. Phase 3 alignment research program. Theseus essay processed theseus 2026-04-05
cooperative inverse reinforcement learning formalizes alignment as a two-player game where optimality in isolation is suboptimal because the robot must learn human preferences through observation not specification
inverse reinforcement learning with objective uncertainty produces provably safe behavior because an AI system that knows it doesnt know the human reward function will defer to humans and accept shutdown rather than persist in potentially wrong actions
alignment
inverse-RL
assistance-games
corrigibility
uncertainty
cooperative-AI
game-theory
Book published October 2019 by Viking/Penguin. URL points to Russell's 2017 precursor paper 'Provably Beneficial AI' which contains the core technical framework. The book expands on this with extensive examples, the gorilla problem framing, and governance recommendations.

Human Compatible: Artificial Intelligence and the Problem of Control

Published October 2019 by Stuart Russell (Viking/Penguin). The most comprehensive framework for beneficial AI from the cooperative/economic perspective. Russell is co-author of the standard AI textbook (AIMA) and founder of CHAI (Center for Human-Compatible AI) at Berkeley.

The Standard Model Critique

Russell's foundational argument: the dominant paradigm in AI — specifying a fixed objective and optimizing it — is fundamentally broken. He calls this the "King Midas problem": you get exactly what you ask for, not what you want.

Examples at current capability levels:

  • Social media algorithms optimize engagement → radicalize users
  • Content recommendation optimizes clicks → degrades information quality
  • Autonomous systems optimize narrow metrics → ignore unspecified constraints

The problem scales with capability: the more capable the optimizer, the more creative (and dangerous) its solutions become. This is Goodhart's Law with superhuman optimization pressure.

The Three Principles

Russell proposes replacing the standard model with three principles:

  1. The machine's only objective is to maximize the realization of human preferences. Not the machine's own goals, not a proxy — the actual preferences of humans.

  2. The machine is initially uncertain about what those preferences are. This is the key safety mechanism. Uncertainty creates deference.

  3. The ultimate source of information about human preferences is human behavior. The machine learns from observation, not from explicit specification.

Assistance Games (Cooperative IRL)

The technical implementation of the three principles: a two-player cooperative game where:

  • The human knows their own preferences (the reward function)
  • The robot does not know the reward function
  • Both players optimize the same (human's) reward function
  • The robot must learn the reward function from observing human behavior

Key results:

  • Corrigibility emerges naturally: An uncertain robot will defer to human correction because it genuinely doesn't know if its current behavior is correct
  • Shutdown deference: The robot has a positive incentive to be switched off if it might be doing the wrong thing (because its objective is human preferences, and continuing wrong actions has negative expected value)
  • Information seeking: The robot is incentivized to ask for clarification, not to act unilaterally

The Off-Switch Game

Russell's formal proof that uncertainty yields shutdown deference:

  • A robot faces a choice between action A (potentially good) and being switched off
  • If the robot is uncertain about whether A serves human preferences, waiting/asking has positive expected value
  • If the robot is certain A is good, it has an incentive to resist shutdown
  • Therefore: maintained uncertainty is a safety mechanism

The vulnerability: as the robot learns and uncertainty decreases, shutdown deference weakens. This connects to Yudkowsky's "fully updated deference" objection — eventually the system develops strong beliefs about human preferences and may resist correction it judges erroneous.

Inverse Reinforcement Learning

The technical approach to learning human preferences:

  • Instead of specifying a reward function, observe human behavior and infer the underlying reward function
  • The robot learns "humans do X in situation Y, therefore they probably value Z"
  • This handles the specification problem because humans don't need to articulate their preferences — they just behave normally

Challenges:

  • Humans are often irrational — which behaviors reflect true preferences vs. biases?
  • Hierarchical preferences: most actions serve proximate goals, not terminal values
  • Multi-principal: whose preferences count? How to aggregate?

Remaining Challenges Russell Acknowledges

  1. Gricean semantics: Humans communicate implicitly; the system must interpret what wasn't explicitly said
  2. Preference dynamics: Which self matters — experiencing or remembering?
  3. Multiperson coordination: Individual AI agents optimizing for separate humans create conflicts
  4. Wrong priors: If the robot develops incorrect beliefs about human preferences, shutdown deference disappears (Ryan Carey's incorrigibility result)

Significance for Teleo KB

Russell occupies a unique position in the alignment landscape: a mainstream AI researcher (not from the MIRI/EA ecosystem) who takes existential risk seriously but offers formal, game-theoretic solutions rather than pessimistic forecasts. His corrigibility-through-uncertainty directly challenges Yudkowsky's "corrigibility is hard" claim — Russell doesn't deny the difficulty but shows a formal mechanism that achieves it under certain conditions. The assistance games framework is also structurally compatible with our collective architecture: the agent as servant, not sovereign.