teleo-codex/domains/ai-alignment/learning human values from observed behavior through inverse reinforcement learning is structurally safer than specifying objectives directly because the agent maintains uncertainty about what humans actually want.md

---
type: claim
domain: ai-alignment
description: "Russell's cooperative AI framework inverts the standard alignment paradigm: instead of specifying what the AI should want and hoping it complies, build the AI to learn what humans want through observation while maintaining the uncertainty that makes it corrigible"
confidence: experimental
source: "Hadfield-Menell, Dragan, Abbeel, Russell, 'Cooperative Inverse Reinforcement Learning' (NeurIPS 2016); Russell, 'Human Compatible: AI and the Problem of Control' (Viking, 2019)"
created: 2026-04-05
related:
  - "an AI agent that is uncertain about its objectives will defer to human shutdown commands because corrigibility emerges from value uncertainty not from engineering against instrumental interests"
  - "RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values"
  - "intelligence and goals are orthogonal so a superintelligence can be maximally competent while pursuing arbitrary or destructive ends"
  - "pluralistic AI alignment through multiple systems preserves value diversity better than forced consensus"
---

# Learning human values from observed behavior through inverse reinforcement learning is structurally safer than specifying objectives directly because the agent maintains uncertainty about what humans actually want

Russell (2019) identifies the "standard model" of AI as the root cause of alignment risk: build a system, give it a fixed objective, let it optimize. This model produces systems that resist shutdown (being turned off prevents goal achievement), pursue resource acquisition (more resources enable more optimization), and generate unintended side effects (any consequence not explicitly penalized in the objective function is irrelevant to the system). The alignment problem under the standard model is how to specify the objective correctly — and Russell argues this is the wrong question.

The alternative: don't specify objectives at all. Build the AI as a cooperative partner that learns human values through observation. This is formalized as Cooperative Inverse Reinforcement Learning (CIRL, Hadfield-Menell et al., NeurIPS 2016) — a two-player cooperative game where the human knows the reward function and the robot must infer it from the human's behavior. Unlike standard IRL (which treats the human as a fixed part of the environment), CIRL models the human as an active participant who can teach, demonstrate, and correct.

The structural safety advantage is that the agent never has a fixed objective to optimize against humans. It maintains genuine uncertainty about what humans want, and this uncertainty makes it cooperative by default. The three principles of beneficial AI make this explicit: (1) the machine's only objective is to maximize human preference realization, (2) it is initially uncertain about those preferences, (3) human behavior is the information source. Together these produce an agent that is incentivized to ask for clarification, accept correction, and defer to human judgment — not because it's been constrained to do so, but because these are instrumentally rational strategies given its uncertainty.

This directly addresses the problem identified by [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]]. Russell's framework doesn't assume a single reward function — it assumes the agent is uncertain about the reward and continuously refines its model through observation. The framework natively accommodates preference diversity because different observed behaviors in different contexts produce a richer preference model than any fixed reward function.

The relationship to the orthogonality thesis is nuanced. [[intelligence and goals are orthogonal so a superintelligence can be maximally competent while pursuing arbitrary or destructive ends]] — Russell accepts orthogonality but argues it strengthens rather than weakens his case. Precisely because intelligence doesn't converge on good values, we must build the uncertainty about values into the architecture rather than hoping the right values emerge from capability scaling.

## Challenges

- Inverse reinforcement learning from human behavior inherits all the biases, irrationalities, and inconsistencies of human behavior. Humans are poor exemplars of their own values — we act against our stated preferences regularly. An IRL agent may learn revealed preferences (what humans do) rather than reflective preferences (what humans would want upon reflection).
- The multi-principal problem is severe. Whose behavior does the agent learn from? Different humans have genuinely incompatible preferences. Aggregating observed behavior across a diverse population may produce incoherent or averaged-out preference models. [[pluralistic AI alignment through multiple systems preserves value diversity better than forced consensus]] suggests that multiple agents with different learned preferences may be structurally better than one agent attempting to learn everyone's preferences.
- Current deployed systems (RLHF, constitutional AI) don't implement Russell's framework — they use fixed reward models derived from human feedback, not ongoing cooperative preference learning. The gap between theory and practice remains large.
- At superhuman capability levels, the agent may resolve its uncertainty about human values — and at that point, the corrigibility guarantee from value uncertainty disappears. This is the capability-dependent ceiling that limits all current alignment approaches.
- Russell's framework assumes humans can be modeled as approximately rational agents whose behavior is informative about their values. In adversarial settings, strategic settings, or settings with systematic cognitive biases, this assumption fails.