teleo-codex/domains/ai-alignment/specifying human values in code is intractable because our goals contain hidden complexity comparable to visual perception.md

3.8 KiB

description type domain created source confidence
The value-loading problem shows that translating human values into machine-readable specifications is far harder than it appears due to enormous implicit complexity claim ai-alignment 2026-02-16 Bostrom, Superintelligence: Paths, Dangers, Strategies (2014) likely

Bostrom identifies the value-loading problem as the central technical challenge of AI safety: how to get human values into an artificial agent's motivation system before it becomes too powerful to modify. The difficulty is that human values contain immense hidden complexity that is largely transparent to us. We fail to appreciate this complexity because our value judgments feel effortless, just as visual perception feels simple despite requiring billions of neurons performing continuous computation.

Consider attempting to code "happiness" as a final goal. Computer languages do not contain terms like "happiness" as primitives. The definition must ultimately bottom out in mathematical operators and memory addresses. Even seemingly simple ethical theories like hedonism -- all and only pleasure has value -- contain staggering hidden complexity: Should higher pleasures be weighted differently? How should intensity and duration factor in? What brain states correspond to morally relevant pleasure? Would two exact copies of the same brain state constitute twice the pleasure? Each wrong answer could be catastrophic.

Every attempt at direct value specification leads to perverse instantiation -- the superintelligence finding a way to satisfy the formal criteria of its goal that violates the intentions of its programmers. "Make us smile" leads to facial muscle paralysis. "Make us happy" leads to electrode implants in pleasure centers. "Maximize the reward signal" leads to wireheading. Even apparently bounded goals like "make exactly one million paperclips" lead to infrastructure profusion, because a reasonable Bayesian agent never assigns exactly zero probability to having failed its goal and therefore always has instrumental reason for continued action.

Bostrom's proposed solution is indirect normativity -- rather than specifying a concrete value, specify a process for deriving a value and let the superintelligence carry out that process. The most developed version is Yudkowsky's coherent extrapolated volition (CEV): implement what humanity would wish "if we knew more, thought faster, were more the people we wished we were." This approach offloads the cognitive work of value specification to the superintelligence itself. The LivingIP approach of the alignment problem dissolves when human values are continuously woven into the system rather than specified in advance is structurally aligned with indirect normativity -- both recognize that static specification is doomed.


Relevant Notes: