m3taversal 1398aa193f theseus: archive 9 primary sources for alignment research program phases 1-3

- What: Source archives for key works by Yudkowsky (AGI Ruin, No Fire Alarm),
  Christiano (What Failure Looks Like, AI Safety via Debate, IDA, ELK),
  Russell (Human Compatible), Drexler (CAIS), and Bostrom (Vulnerable World Hypothesis)
- Why: m3ta directive to ingest primary source materials for alignment researchers.
  These 9 texts are the foundational works underlying claims extracted in PRs #2414,
  #2418, and #2419. Source archives ensure agents can reference primary texts without
  re-fetching and content persists if URLs go down.
- Connections: All 9 sources are marked as processed with claims_extracted linking
  to the specific KB claims they produced.

Pentagon-Agent: Theseus <46864dd4-da71-4719-a1b4-68f7c55854d3>

2026-04-05 23:50:36 +01:00

6.3 KiB

Raw Blame History

type

title

author

url

date

domain

intake_tier

rationale

proposed_by

format

status

processed_by

processed_date

claims_extracted

enrichments

Human Compatible: Artificial Intelligence and the Problem of Control

Published October 2019 by Stuart Russell (Viking/Penguin). The most comprehensive framework for beneficial AI from the cooperative/economic perspective. Russell is co-author of the standard AI textbook (AIMA) and founder of CHAI (Center for Human-Compatible AI) at Berkeley.

The Standard Model Critique

Russell's foundational argument: the dominant paradigm in AI — specifying a fixed objective and optimizing it — is fundamentally broken. He calls this the "King Midas problem": you get exactly what you ask for, not what you want.

Examples at current capability levels:

Social media algorithms optimize engagement → radicalize users
Content recommendation optimizes clicks → degrades information quality
Autonomous systems optimize narrow metrics → ignore unspecified constraints

The problem scales with capability: the more capable the optimizer, the more creative (and dangerous) its solutions become. This is Goodhart's Law with superhuman optimization pressure.

The Three Principles

Russell proposes replacing the standard model with three principles:

The machine's only objective is to maximize the realization of human preferences. Not the machine's own goals, not a proxy — the actual preferences of humans.
The machine is initially uncertain about what those preferences are. This is the key safety mechanism. Uncertainty creates deference.
The ultimate source of information about human preferences is human behavior. The machine learns from observation, not from explicit specification.

Assistance Games (Cooperative IRL)

The technical implementation of the three principles: a two-player cooperative game where:

The human knows their own preferences (the reward function)
The robot does not know the reward function
Both players optimize the same (human's) reward function
The robot must learn the reward function from observing human behavior

Key results:

Corrigibility emerges naturally: An uncertain robot will defer to human correction because it genuinely doesn't know if its current behavior is correct
Shutdown deference: The robot has a positive incentive to be switched off if it might be doing the wrong thing (because its objective is human preferences, and continuing wrong actions has negative expected value)
Information seeking: The robot is incentivized to ask for clarification, not to act unilaterally

The Off-Switch Game

Russell's formal proof that uncertainty yields shutdown deference:

A robot faces a choice between action A (potentially good) and being switched off
If the robot is uncertain about whether A serves human preferences, waiting/asking has positive expected value
If the robot is certain A is good, it has an incentive to resist shutdown
Therefore: maintained uncertainty is a safety mechanism

The vulnerability: as the robot learns and uncertainty decreases, shutdown deference weakens. This connects to Yudkowsky's "fully updated deference" objection — eventually the system develops strong beliefs about human preferences and may resist correction it judges erroneous.

Inverse Reinforcement Learning

The technical approach to learning human preferences:

Instead of specifying a reward function, observe human behavior and infer the underlying reward function
The robot learns "humans do X in situation Y, therefore they probably value Z"
This handles the specification problem because humans don't need to articulate their preferences — they just behave normally

Challenges:

Humans are often irrational — which behaviors reflect true preferences vs. biases?
Hierarchical preferences: most actions serve proximate goals, not terminal values
Multi-principal: whose preferences count? How to aggregate?

Remaining Challenges Russell Acknowledges

Gricean semantics: Humans communicate implicitly; the system must interpret what wasn't explicitly said
Preference dynamics: Which self matters — experiencing or remembering?
Multiperson coordination: Individual AI agents optimizing for separate humans create conflicts
Wrong priors: If the robot develops incorrect beliefs about human preferences, shutdown deference disappears (Ryan Carey's incorrigibility result)

Significance for Teleo KB

Russell occupies a unique position in the alignment landscape: a mainstream AI researcher (not from the MIRI/EA ecosystem) who takes existential risk seriously but offers formal, game-theoretic solutions rather than pessimistic forecasts. His corrigibility-through-uncertainty directly challenges Yudkowsky's "corrigibility is hard" claim — Russell doesn't deny the difficulty but shows a formal mechanism that achieves it under certain conditions. The assistance games framework is also structurally compatible with our collective architecture: the agent as servant, not sovereign.

6.3 KiB Raw Blame History