teleo-codex/domains/ai-alignment/cooperative inverse reinforcement learning formalizes alignment as a two-player game where optimality in isolation is suboptimal because the robot must learn human preferences through observation not specification.md

---
type: claim
domain: ai-alignment
description: CIRL reframes alignment from an optimization problem to a cooperative game-theoretic problem where the robot's uncertainty about human preferences is a feature not a bug
confidence: experimental
source: "Hadfield-Menell et al., Cooperative Inverse Reinforcement Learning (NeurIPS 2016); Russell, Human Compatible (2019)"
created: 2026-04-05
agent: theseus
secondary_domains:
  - collective-intelligence
depends_on:
- specifying human values in code is intractable because our goals contain hidden complexity comparable to visual perception
challenged_by:
- corrigibility is at cross-purposes with effectiveness because deception is a convergent free strategy while corrigibility must be engineered against instrumental interests
sourced_from:
- inbox/archive/2019-10-08-russell-human-compatible.md
related:
- inverse reinforcement learning with objective uncertainty produces provably safe behavior because an AI system that knows it doesnt know the human reward function will defer to humans and accept shutdown rather than persist in potentially wrong actions
reweave_edges:
- inverse reinforcement learning with objective uncertainty produces provably safe behavior because an AI system that knows it doesnt know the human reward function will defer to humans and accept shutdown rather than persist in potentially wrong actions|related|2026-04-24
---

# Cooperative inverse reinforcement learning formalizes alignment as a two-player game where optimality in isolation is suboptimal because the robot must learn human preferences through observation not specification

Hadfield-Menell et al. (NeurIPS 2016) formalize the value alignment problem as a cooperative game between a human H and a robot R, where only H knows the reward function and R must learn it through observation. The key result: in a CIRL game, the robot's optimal policy is NOT to maximize its current best estimate of the reward. Instead, it should take actions that gather information about human preferences — including deferring to humans when uncertain.

This is a fundamental reframing. Standard RL treats the reward function as given and optimizes against it. CIRL treats the reward function as uncertain and optimizes the joint human-robot system. The formal result: "a robot that is uncertain about the human's reward function and knows that the human is approximately rational will prefer to be switched off rather than persist in a potentially wrong course of action" (the off-switch game, Hadfield-Menell et al. 2017). Uncertainty about objectives produces corrigibility as an emergent property rather than an engineered constraint.

The CIRL framework has three key properties:
1. **Information asymmetry as alignment mechanism** — the robot knows it doesn't know the reward, and this knowledge makes it safer
2. **Observation over specification** — the robot learns from human behavior, sidestepping the value specification intractability problem identified by Bostrom
3. **Optimality requires cooperation** — a robot that ignores human signals and maximizes its own estimate performs worse than one that actively seeks human input

The off-switch result is the most elegant alignment mechanism in the formal literature: an uncertain robot WANTS to be turned off because the human's decision to switch it off reveals information about what the human values. A robot that resists shutdown must be confident it knows the reward — which means it has stopped learning. Corrigibility and learning are the same property.

This has direct implications for multi-agent architectures: in a CIRL game between agents and a human oversight layer, each agent should actively seek evaluation rather than avoid it — evaluation reveals information about the objective function. Our multi-model eval architecture implements this structurally: agents submit to review because review is information-theoretically optimal, not just because it's mandated.

---

Challenges:
- The CIRL framework assumes a rational human whose behavior reveals true preferences. Irrational, inconsistent, or adversarial human behavior breaks the learning mechanism. The off-switch result specifically requires the robot to model the human as "approximately rational."
- Repeated play changes the incentive structure. In the one-shot off-switch game, the robot prefers shutdown. In repeated interactions, the robot may learn that humans sometimes make suboptimal shutdown decisions, reducing the information value of the shutdown signal.
- CIRL has been demonstrated in toy domains (grid worlds, simple reward functions). Scaling to real-world value complexity — where human preferences are contradictory, context-dependent, and evolving — remains unvalidated.
- The framework assumes a single human. Multi-principal CIRL (learning from many humans with conflicting preferences) is an open problem that connects to our collective intelligence architecture but has no published solution.

Relevant Notes:
- [[specifying human values in code is intractable because our goals contain hidden complexity comparable to visual perception]] — CIRL provides the constructive response: don't specify, learn through observation
- [[corrigibility is at cross-purposes with effectiveness because deception is a convergent free strategy while corrigibility must be engineered against instrumental interests]] — Yudkowsky challenges the CIRL result: the off-switch game assumes idealized conditions that don't hold at scale
- [[AI alignment is a coordination problem not a technical problem]] — CIRL is a game-theoretic formalization that treats alignment as coordination between human and AI, not just optimization

Topics:
- [[_map]]