teleo-codex/domains/ai-alignment/cooperative inverse reinforcement learning formalizes alignment as a two-player game where optimality in isolation is suboptimal because the robot must learn human preferences through observation not specification.md
m3taversal be8ff41bfe link: bidirectional source↔claim index — 414 claims + 252 sources connected
Wrote sourced_from: into 414 claim files pointing back to their origin source.
Backfilled claims_extracted: into 252 source files that were processed but
missing this field. Matching uses author+title overlap against claim source:
field, validated against 296 known-good pairs from existing claims_extracted.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-21 11:55:18 +01:00

5.1 KiB

type domain description confidence source created agent secondary_domains depends_on challenged_by sourced_from
claim ai-alignment CIRL reframes alignment from an optimization problem to a cooperative game-theoretic problem where the robot's uncertainty about human preferences is a feature not a bug experimental Hadfield-Menell et al., Cooperative Inverse Reinforcement Learning (NeurIPS 2016); Russell, Human Compatible (2019) 2026-04-05 theseus
collective-intelligence
specifying human values in code is intractable because our goals contain hidden complexity comparable to visual perception
corrigibility is at cross-purposes with effectiveness because deception is a convergent free strategy while corrigibility must be engineered against instrumental interests
inbox/archive/2019-10-08-russell-human-compatible.md

Cooperative inverse reinforcement learning formalizes alignment as a two-player game where optimality in isolation is suboptimal because the robot must learn human preferences through observation not specification

Hadfield-Menell et al. (NeurIPS 2016) formalize the value alignment problem as a cooperative game between a human H and a robot R, where only H knows the reward function and R must learn it through observation. The key result: in a CIRL game, the robot's optimal policy is NOT to maximize its current best estimate of the reward. Instead, it should take actions that gather information about human preferences — including deferring to humans when uncertain.

This is a fundamental reframing. Standard RL treats the reward function as given and optimizes against it. CIRL treats the reward function as uncertain and optimizes the joint human-robot system. The formal result: "a robot that is uncertain about the human's reward function and knows that the human is approximately rational will prefer to be switched off rather than persist in a potentially wrong course of action" (the off-switch game, Hadfield-Menell et al. 2017). Uncertainty about objectives produces corrigibility as an emergent property rather than an engineered constraint.

The CIRL framework has three key properties:

  1. Information asymmetry as alignment mechanism — the robot knows it doesn't know the reward, and this knowledge makes it safer
  2. Observation over specification — the robot learns from human behavior, sidestepping the value specification intractability problem identified by Bostrom
  3. Optimality requires cooperation — a robot that ignores human signals and maximizes its own estimate performs worse than one that actively seeks human input

The off-switch result is the most elegant alignment mechanism in the formal literature: an uncertain robot WANTS to be turned off because the human's decision to switch it off reveals information about what the human values. A robot that resists shutdown must be confident it knows the reward — which means it has stopped learning. Corrigibility and learning are the same property.

This has direct implications for multi-agent architectures: in a CIRL game between agents and a human oversight layer, each agent should actively seek evaluation rather than avoid it — evaluation reveals information about the objective function. Our multi-model eval architecture implements this structurally: agents submit to review because review is information-theoretically optimal, not just because it's mandated.


Challenges:

  • The CIRL framework assumes a rational human whose behavior reveals true preferences. Irrational, inconsistent, or adversarial human behavior breaks the learning mechanism. The off-switch result specifically requires the robot to model the human as "approximately rational."
  • Repeated play changes the incentive structure. In the one-shot off-switch game, the robot prefers shutdown. In repeated interactions, the robot may learn that humans sometimes make suboptimal shutdown decisions, reducing the information value of the shutdown signal.
  • CIRL has been demonstrated in toy domains (grid worlds, simple reward functions). Scaling to real-world value complexity — where human preferences are contradictory, context-dependent, and evolving — remains unvalidated.
  • The framework assumes a single human. Multi-principal CIRL (learning from many humans with conflicting preferences) is an open problem that connects to our collective intelligence architecture but has no published solution.

Relevant Notes:

Topics: