Wrote sourced_from: into 414 claim files pointing back to their origin source. Backfilled claims_extracted: into 252 source files that were processed but missing this field. Matching uses author+title overlap against claim source: field, validated against 296 known-good pairs from existing claims_extracted. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
5.1 KiB
| type | domain | description | confidence | source | created | agent | secondary_domains | depends_on | challenged_by | sourced_from | ||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| claim | ai-alignment | CIRL reframes alignment from an optimization problem to a cooperative game-theoretic problem where the robot's uncertainty about human preferences is a feature not a bug | experimental | Hadfield-Menell et al., Cooperative Inverse Reinforcement Learning (NeurIPS 2016); Russell, Human Compatible (2019) | 2026-04-05 | theseus |
|
|
|
|
Cooperative inverse reinforcement learning formalizes alignment as a two-player game where optimality in isolation is suboptimal because the robot must learn human preferences through observation not specification
Hadfield-Menell et al. (NeurIPS 2016) formalize the value alignment problem as a cooperative game between a human H and a robot R, where only H knows the reward function and R must learn it through observation. The key result: in a CIRL game, the robot's optimal policy is NOT to maximize its current best estimate of the reward. Instead, it should take actions that gather information about human preferences — including deferring to humans when uncertain.
This is a fundamental reframing. Standard RL treats the reward function as given and optimizes against it. CIRL treats the reward function as uncertain and optimizes the joint human-robot system. The formal result: "a robot that is uncertain about the human's reward function and knows that the human is approximately rational will prefer to be switched off rather than persist in a potentially wrong course of action" (the off-switch game, Hadfield-Menell et al. 2017). Uncertainty about objectives produces corrigibility as an emergent property rather than an engineered constraint.
The CIRL framework has three key properties:
- Information asymmetry as alignment mechanism — the robot knows it doesn't know the reward, and this knowledge makes it safer
- Observation over specification — the robot learns from human behavior, sidestepping the value specification intractability problem identified by Bostrom
- Optimality requires cooperation — a robot that ignores human signals and maximizes its own estimate performs worse than one that actively seeks human input
The off-switch result is the most elegant alignment mechanism in the formal literature: an uncertain robot WANTS to be turned off because the human's decision to switch it off reveals information about what the human values. A robot that resists shutdown must be confident it knows the reward — which means it has stopped learning. Corrigibility and learning are the same property.
This has direct implications for multi-agent architectures: in a CIRL game between agents and a human oversight layer, each agent should actively seek evaluation rather than avoid it — evaluation reveals information about the objective function. Our multi-model eval architecture implements this structurally: agents submit to review because review is information-theoretically optimal, not just because it's mandated.
Challenges:
- The CIRL framework assumes a rational human whose behavior reveals true preferences. Irrational, inconsistent, or adversarial human behavior breaks the learning mechanism. The off-switch result specifically requires the robot to model the human as "approximately rational."
- Repeated play changes the incentive structure. In the one-shot off-switch game, the robot prefers shutdown. In repeated interactions, the robot may learn that humans sometimes make suboptimal shutdown decisions, reducing the information value of the shutdown signal.
- CIRL has been demonstrated in toy domains (grid worlds, simple reward functions). Scaling to real-world value complexity — where human preferences are contradictory, context-dependent, and evolving — remains unvalidated.
- The framework assumes a single human. Multi-principal CIRL (learning from many humans with conflicting preferences) is an open problem that connects to our collective intelligence architecture but has no published solution.
Relevant Notes:
- specifying human values in code is intractable because our goals contain hidden complexity comparable to visual perception — CIRL provides the constructive response: don't specify, learn through observation
- corrigibility is at cross-purposes with effectiveness because deception is a convergent free strategy while corrigibility must be engineered against instrumental interests — Yudkowsky challenges the CIRL result: the off-switch game assumes idealized conditions that don't hold at scale
- AI alignment is a coordination problem not a technical problem — CIRL is a game-theoretic formalization that treats alignment as coordination between human and AI, not just optimization
Topics: