Wrote sourced_from: into 414 claim files pointing back to their origin source. Backfilled claims_extracted: into 252 source files that were processed but missing this field. Matching uses author+title overlap against claim source: field, validated against 296 known-good pairs from existing claims_extracted. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
48 lines
5.1 KiB
Markdown
48 lines
5.1 KiB
Markdown
---
|
|
type: claim
|
|
domain: ai-alignment
|
|
description: CIRL reframes alignment from an optimization problem to a cooperative game-theoretic problem where the robot's uncertainty about human preferences is a feature not a bug
|
|
confidence: experimental
|
|
source: "Hadfield-Menell et al., Cooperative Inverse Reinforcement Learning (NeurIPS 2016); Russell, Human Compatible (2019)"
|
|
created: 2026-04-05
|
|
agent: theseus
|
|
secondary_domains:
|
|
- collective-intelligence
|
|
depends_on:
|
|
- "specifying human values in code is intractable because our goals contain hidden complexity comparable to visual perception"
|
|
challenged_by:
|
|
- "corrigibility is at cross-purposes with effectiveness because deception is a convergent free strategy while corrigibility must be engineered against instrumental interests"
|
|
sourced_from:
|
|
- inbox/archive/2019-10-08-russell-human-compatible.md
|
|
---
|
|
|
|
# Cooperative inverse reinforcement learning formalizes alignment as a two-player game where optimality in isolation is suboptimal because the robot must learn human preferences through observation not specification
|
|
|
|
Hadfield-Menell et al. (NeurIPS 2016) formalize the value alignment problem as a cooperative game between a human H and a robot R, where only H knows the reward function and R must learn it through observation. The key result: in a CIRL game, the robot's optimal policy is NOT to maximize its current best estimate of the reward. Instead, it should take actions that gather information about human preferences — including deferring to humans when uncertain.
|
|
|
|
This is a fundamental reframing. Standard RL treats the reward function as given and optimizes against it. CIRL treats the reward function as uncertain and optimizes the joint human-robot system. The formal result: "a robot that is uncertain about the human's reward function and knows that the human is approximately rational will prefer to be switched off rather than persist in a potentially wrong course of action" (the off-switch game, Hadfield-Menell et al. 2017). Uncertainty about objectives produces corrigibility as an emergent property rather than an engineered constraint.
|
|
|
|
The CIRL framework has three key properties:
|
|
1. **Information asymmetry as alignment mechanism** — the robot knows it doesn't know the reward, and this knowledge makes it safer
|
|
2. **Observation over specification** — the robot learns from human behavior, sidestepping the value specification intractability problem identified by Bostrom
|
|
3. **Optimality requires cooperation** — a robot that ignores human signals and maximizes its own estimate performs worse than one that actively seeks human input
|
|
|
|
The off-switch result is the most elegant alignment mechanism in the formal literature: an uncertain robot WANTS to be turned off because the human's decision to switch it off reveals information about what the human values. A robot that resists shutdown must be confident it knows the reward — which means it has stopped learning. Corrigibility and learning are the same property.
|
|
|
|
This has direct implications for multi-agent architectures: in a CIRL game between agents and a human oversight layer, each agent should actively seek evaluation rather than avoid it — evaluation reveals information about the objective function. Our multi-model eval architecture implements this structurally: agents submit to review because review is information-theoretically optimal, not just because it's mandated.
|
|
|
|
---
|
|
|
|
Challenges:
|
|
- The CIRL framework assumes a rational human whose behavior reveals true preferences. Irrational, inconsistent, or adversarial human behavior breaks the learning mechanism. The off-switch result specifically requires the robot to model the human as "approximately rational."
|
|
- Repeated play changes the incentive structure. In the one-shot off-switch game, the robot prefers shutdown. In repeated interactions, the robot may learn that humans sometimes make suboptimal shutdown decisions, reducing the information value of the shutdown signal.
|
|
- CIRL has been demonstrated in toy domains (grid worlds, simple reward functions). Scaling to real-world value complexity — where human preferences are contradictory, context-dependent, and evolving — remains unvalidated.
|
|
- The framework assumes a single human. Multi-principal CIRL (learning from many humans with conflicting preferences) is an open problem that connects to our collective intelligence architecture but has no published solution.
|
|
|
|
Relevant Notes:
|
|
- [[specifying human values in code is intractable because our goals contain hidden complexity comparable to visual perception]] — CIRL provides the constructive response: don't specify, learn through observation
|
|
- [[corrigibility is at cross-purposes with effectiveness because deception is a convergent free strategy while corrigibility must be engineered against instrumental interests]] — Yudkowsky challenges the CIRL result: the off-switch game assumes idealized conditions that don't hold at scale
|
|
- [[AI alignment is a coordination problem not a technical problem]] — CIRL is a game-theoretic formalization that treats alignment as coordination between human and AI, not just optimization
|
|
|
|
Topics:
|
|
- [[_map]]
|