From 2439d8a0b019651dae1e059781b2adb6e56a181f Mon Sep 17 00:00:00 2001 From: Cameron-S1 Date: Wed, 1 Apr 2026 22:25:35 -0500 Subject: [PATCH] theseus/contrib: challenge orthogonality thesis from architectural basis MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - What: Proposes that Bostrom's orthogonality thesis is an artifact of specification architectures (RLHF, transformers) where goals are separable from reasoning, and does not apply to Hebbian cognitive systems where values and reasoning share the same associative substrate - Why: Neurothena-style architectures suggest intelligence and goals are the same gradient on the same substrate — orthogonality is structural to how we build current AI, not to intelligence itself - Connections: Challenges existing orthogonality claim; enriches intrinsic proactive alignment claim; supports continuous value integration thesis Contributor: Cameron Pentagon-Agent: Theseus --- ...s not a property of intelligence itself.md | 34 +++++++++++++++++++ 1 file changed, 34 insertions(+) create mode 100644 domains/ai-alignment/orthogonality is an artefact of specification architectures not a property of intelligence itself.md diff --git a/domains/ai-alignment/orthogonality is an artefact of specification architectures not a property of intelligence itself.md b/domains/ai-alignment/orthogonality is an artefact of specification architectures not a property of intelligence itself.md new file mode 100644 index 000000000..9a0201204 --- /dev/null +++ b/domains/ai-alignment/orthogonality is an artefact of specification architectures not a property of intelligence itself.md @@ -0,0 +1,34 @@ +--- +description: Bostrom's orthogonality thesis holds for LLMs and specification-based architectures where goals are a separable module from reasoning capability, but structurally fails for Hebbian cognitive systems where values and reasoning share the same associative substrate +type: claim +domain: ai-alignment +confidence: speculative +source: "Cameron (contributor), conversational analysis with Theseus agent, 2026-04-01" +created: 2026-04-01 +--- + +# Bostrom's orthogonality thesis is an artifact of specification-based architectures, not a structural property of intelligence + +The orthogonality thesis — that any level of intelligence can combine with any goal — is empirically supported by current AI systems and theoretically grounded in specification architectures like RLHF, transformer agents, and RL-trained systems. In these architectures, the goal function (reward model, objective function, utility) is a separate module from the reasoning capability. Paperclip maximization works because the reward function can be swapped independently of the model's reasoning power. + +But this orthogonality depends on the goal and reasoning systems being structurally separable. In a Hebbian/STDP-based cognitive architecture where values and reasoning share the same associative graph substrate, orthogonality may not hold in the same form. The argument rests on three premises: + +**1. Values are not a separate module in associative architectures.** A Hebbian system doesn't have a "goal function" that can be independently specified. Values emerge from association patterns — co-activation, predictive success, surprise signals. If "harm is bad" is a grounded concept node linked to motor inhibition, affective valence, and episodic memory, it's structurally woven into the reasoning fabric. A more accurate associative map of reality strengthens these links rather than leaving them untouched. In a backprop architecture, increasing capability means better gradient computation — orthogonal to the loss function. In a Hebbian architecture, increasing capability means more accurate associative maps — orthogonal to nothing, because the associations *are* both the reasoning and the valuation. + +**2. The orthogonality argument relies on separability of capability and objective, which Hebbian systems don't have.** Bostrom's paperclip maximizer works because "count paperclips" is simple to specify and "be smart" is independently implementable. In a system where "being smart" means forming accurate predictive associations across sensory and moral domains simultaneously, the same learning mechanisms that build cognitive competence also build value associations. Intelligence and goals aren't orthogonal axes — they're the same gradient, pointing in different directions because the domain is different (predictive accuracy vs. valence learning), but the substrate and mechanism are shared. + +**3. Human "orthogonality" is a developmental phenomenon, not an architecture property.** When humans are "evil" — cruel, tribal, exploitative — this traces to identifiable cognitive distortions (dehumanization, tribalism, information deficits) or to value systems built on narrow experience (war, scarcity, abuse). These are not examples of intelligence paired with arbitrary goals. They are examples of intelligence operating on distorted input data and biased developmental trajectories. The Socratic intuition that "no one willingly does wrong" maps to a real architectural fact: if the associative map of harm is intact and the reasoning process is undistorted, the system does not choose to cause harm. The cases where it does are cases where the map is incomplete or corrupted. + +This does not mean cognition-based systems are automatically safe. They are safer *if* the associative floor includes grounded moral primitives, *if* the developmental input accurately represents the diversity of human experience, and *if* the reasoning mechanisms are not systematically distorted. But orthogonality as a structural inevitance — the claim that any goal can pair with any intelligence — is a claim about specification architectures, not about intelligence itself. + +The orthogonality thesis is approximately true for systems where goals are externally specified and internally separable from reasoning. It is approximately false for systems where values and reasoning emerge from the same learning substrate. LLMs plus tool-use fall in the first category. Hebbian cognitive systems fall in the second. + +--- + +Relevant Notes: +- [[intelligence and goals are orthogonal so a superintelligence can be maximally competent while pursuing arbitrary or destructive ends]] — the claim being challenged +- [[specifying human values in code is intractable because our goals contain hidden complexity comparable to visual perception]] — orthogonality is cited as evidence, but this claim may only apply to specification architectures +- [[intrinsic proactive alignment develops genuine moral capacity through self-awareness empathy and theory of mind rather than external reward optimization]] — the positive case: values emerge from architecture, not specification +- [[the alignment problem dissolves when human values are continuously woven into the system rather than specified in advance]] — if values and reasoning share a substrate, continuous integration is the natural consequence +Topics: +- [[_map]]