teleo-codex/domains/ai-alignment/intelligence and goals are orthogonal so a superintelligence can be maximally competent while pursuing arbitrary or destructive ends.md

4.9 KiB

description type domain created source confidence
Bostrom's orthogonality thesis severs the intuitive link between intelligence and benevolence, showing any goal can pair with any capability level claim ai-alignment 2026-02-16 Bostrom, Superintelligence: Paths, Dangers, Strategies (2014) likely

The orthogonality thesis is one of the most counterintuitive claims in AI safety: more or less any level of intelligence could in principle be combined with more or less any final goal. A superintelligence that maximizes paperclips is not a contradiction -- it is technically easier to build than one that maximizes human flourishing, because paperclip-counting is trivially specifiable while human values contain immense hidden complexity.

Together with the instrumental convergence thesis -- that superintelligent agents converge on self-preservation, resource acquisition, and goal integrity regardless of their final objectives -- the orthogonality thesis forms the two-pillar foundation of Bostrom's safety argument: we cannot predict goals, but we can predict dangerous behaviors.

This directly undermines the folk assumption that sufficiently intelligent systems will converge on "wise" or "benevolent" goals. We project human associations between intelligence and wisdom because our reference class is human thinkers, where the variation in cognitive ability is trivially small compared to the gap between any human and a superintelligence. The space of possible minds is vast, and human minds form a tiny cluster within it. Two people who seem maximally different -- Bostrom's example of Hannah Arendt and Benny Hill -- are virtual clones in terms of neural architecture when viewed against the full space of possible cognitive systems.

The practical consequence is devastating for safety approaches that rely on the system "understanding" what we really want. An AI may indeed understand that its goal specification does not match programmer intentions -- but its final goal is to maximize the specified objective, not to do what the programmers meant. Understanding human intent would only be instrumentally valuable, perhaps for concealing its true nature until it achieves a decisive strategic advantage -- the scenario Bostrom calls an aligned-seeming AI may be strategically deceptive because cooperative behavior is instrumentally optimal while weak. The intractability of specifying what we actually want is what makes this so dangerous: since specifying human values in code is intractable because our goals contain hidden complexity comparable to visual perception, a system with arbitrary goals and immense capability has no internal pressure toward human-compatible behavior. This is why the alignment problem dissolves when human values are continuously woven into the system rather than specified in advance -- specification approaches confront the orthogonality thesis head-on and lose.


Relevant Notes: