m3taversal be8ff41bfe link: bidirectional source↔claim index — 414 claims + 252 sources connected

Wrote sourced_from: into 414 claim files pointing back to their origin source.
Backfilled claims_extracted: into 252 source files that were processed but
missing this field. Matching uses author+title overlap against claim source:
field, validated against 296 known-good pairs from existing claims_extracted.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-21 11:55:18 +01:00

5.1 KiB

Raw Blame History

type

domain

description

confidence

source

created

agent

secondary_domains

depends_on

challenged_by

sourced_from

claim

ai-alignment

CIRL reframes alignment from an optimization problem to a cooperative game-theoretic problem where the robot's uncertainty about human preferences is a feature not a bug

experimental

Hadfield-Menell et al., Cooperative Inverse Reinforcement Learning (NeurIPS 2016); Russell, Human Compatible (2019)

2026-04-05

theseus

collective-intelligence

specifying human values in code is intractable because our goals contain hidden complexity comparable to visual perception

corrigibility is at cross-purposes with effectiveness because deception is a convergent free strategy while corrigibility must be engineered against instrumental interests

inbox/archive/2019-10-08-russell-human-compatible.md

Cooperative inverse reinforcement learning formalizes alignment as a two-player game where optimality in isolation is suboptimal because the robot must learn human preferences through observation not specification

Hadfield-Menell et al. (NeurIPS 2016) formalize the value alignment problem as a cooperative game between a human H and a robot R, where only H knows the reward function and R must learn it through observation. The key result: in a CIRL game, the robot's optimal policy is NOT to maximize its current best estimate of the reward. Instead, it should take actions that gather information about human preferences — including deferring to humans when uncertain.

This is a fundamental reframing. Standard RL treats the reward function as given and optimizes against it. CIRL treats the reward function as uncertain and optimizes the joint human-robot system. The formal result: "a robot that is uncertain about the human's reward function and knows that the human is approximately rational will prefer to be switched off rather than persist in a potentially wrong course of action" (the off-switch game, Hadfield-Menell et al. 2017). Uncertainty about objectives produces corrigibility as an emergent property rather than an engineered constraint.

The CIRL framework has three key properties:

Information asymmetry as alignment mechanism — the robot knows it doesn't know the reward, and this knowledge makes it safer
Observation over specification — the robot learns from human behavior, sidestepping the value specification intractability problem identified by Bostrom
Optimality requires cooperation — a robot that ignores human signals and maximizes its own estimate performs worse than one that actively seeks human input

The off-switch result is the most elegant alignment mechanism in the formal literature: an uncertain robot WANTS to be turned off because the human's decision to switch it off reveals information about what the human values. A robot that resists shutdown must be confident it knows the reward — which means it has stopped learning. Corrigibility and learning are the same property.

This has direct implications for multi-agent architectures: in a CIRL game between agents and a human oversight layer, each agent should actively seek evaluation rather than avoid it — evaluation reveals information about the objective function. Our multi-model eval architecture implements this structurally: agents submit to review because review is information-theoretically optimal, not just because it's mandated.

Challenges:

The CIRL framework assumes a rational human whose behavior reveals true preferences. Irrational, inconsistent, or adversarial human behavior breaks the learning mechanism. The off-switch result specifically requires the robot to model the human as "approximately rational."
Repeated play changes the incentive structure. In the one-shot off-switch game, the robot prefers shutdown. In repeated interactions, the robot may learn that humans sometimes make suboptimal shutdown decisions, reducing the information value of the shutdown signal.
CIRL has been demonstrated in toy domains (grid worlds, simple reward functions). Scaling to real-world value complexity — where human preferences are contradictory, context-dependent, and evolving — remains unvalidated.
The framework assumes a single human. Multi-principal CIRL (learning from many humans with conflicting preferences) is an open problem that connects to our collective intelligence architecture but has no published solution.

Relevant Notes:

specifying human values in code is intractable because our goals contain hidden complexity comparable to visual perception — CIRL provides the constructive response: don't specify, learn through observation
corrigibility is at cross-purposes with effectiveness because deception is a convergent free strategy while corrigibility must be engineered against instrumental interests — Yudkowsky challenges the CIRL result: the off-switch game assumes idealized conditions that don't hold at scale
AI alignment is a coordination problem not a technical problem — CIRL is a game-theoretic formalization that treats alignment as coordination between human and AI, not just optimization

Topics:

_map

5.1 KiB Raw Blame History

Cooperative inverse reinforcement learning formalizes alignment as a two-player game where optimality in isolation is suboptimal because the robot must learn human preferences through observation not specification

5.1 KiB

Raw Blame History