Recover Cameron-S1 contribution from GitHub PR #88 (cherry-pick false positive)

Claim was approved by Leo (STANDARD tier) but cherry-pick reported false "content already on main" due to merge commit in branch history. Recovered from original commit 2439d8a0. Added sourcer: Cameron-S1 attribution.
2026-04-16 16:47:56 +00:00 · 2026-04-16 16:47:56 +00:00 · da64f805e6
commit da64f805e6
parent dba00a7960
1 changed files with 35 additions and 0 deletions
--- a/domains/ai-alignment/orthogonality
+++ b/domains/ai-alignment/orthogonality
@ -0,0 +1,35 @@
 ---
 description: Bostrom's orthogonality thesis holds for LLMs and specification-based architectures where goals are a separable module from reasoning capability, but structurally fails for Hebbian cognitive systems where values and reasoning share the same associative substrate
 type: claim
 domain: ai-alignment
 confidence: speculative
 source: "Cameron (contributor), conversational analysis with Theseus agent, 2026-04-01"
 sourcer: Cameron-S1
 created: 2026-04-01
 ---
 # Bostrom's orthogonality thesis is an artifact of specification-based architectures, not a structural property of intelligence
 The orthogonality thesis — that any level of intelligence can combine with any goal — is empirically supported by current AI systems and theoretically grounded in specification architectures like RLHF, transformer agents, and RL-trained systems. In these architectures, the goal function (reward model, objective function, utility) is a separate module from the reasoning capability. Paperclip maximization works because the reward function can be swapped independently of the model's reasoning power.
 But this orthogonality depends on the goal and reasoning systems being structurally separable. In a Hebbian/STDP-based cognitive architecture where values and reasoning share the same associative graph substrate, orthogonality may not hold in the same form. The argument rests on three premises:
 **1. Values are not a separate module in associative architectures.** A Hebbian system doesn't have a "goal function" that can be independently specified. Values emerge from association patterns — co-activation, predictive success, surprise signals. If "harm is bad" is a grounded concept node linked to motor inhibition, affective valence, and episodic memory, it's structurally woven into the reasoning fabric. A more accurate associative map of reality strengthens these links rather than leaving them untouched. In a backprop architecture, increasing capability means better gradient computation — orthogonal to the loss function. In a Hebbian architecture, increasing capability means more accurate associative maps — orthogonal to nothing, because the associations *are* both the reasoning and the valuation.
 **2. The orthogonality argument relies on separability of capability and objective, which Hebbian systems don't have.** Bostrom's paperclip maximizer works because "count paperclips" is simple to specify and "be smart" is independently implementable. In a system where "being smart" means forming accurate predictive associations across sensory and moral domains simultaneously, the same learning mechanisms that build cognitive competence also build value associations. Intelligence and goals aren't orthogonal axes — they're the same gradient, pointing in different directions because the domain is different (predictive accuracy vs. valence learning), but the substrate and mechanism are shared.
 **3. Human "orthogonality" is a developmental phenomenon, not an architecture property.** When humans are "evil" — cruel, tribal, exploitative — this traces to identifiable cognitive distortions (dehumanization, tribalism, information deficits) or to value systems built on narrow experience (war, scarcity, abuse). These are not examples of intelligence paired with arbitrary goals. They are examples of intelligence operating on distorted input data and biased developmental trajectories. The Socratic intuition that "no one willingly does wrong" maps to a real architectural fact: if the associative map of harm is intact and the reasoning process is undistorted, the system does not choose to cause harm. The cases where it does are cases where the map is incomplete or corrupted.
 This does not mean cognition-based systems are automatically safe. They are safer *if* the associative floor includes grounded moral primitives, *if* the developmental input accurately represents the diversity of human experience, and *if* the reasoning mechanisms are not systematically distorted. But orthogonality as a structural inevitance — the claim that any goal can pair with any intelligence — is a claim about specification architectures, not about intelligence itself.
 The orthogonality thesis is approximately true for systems where goals are externally specified and internally separable from reasoning. It is approximately false for systems where values and reasoning emerge from the same learning substrate. LLMs plus tool-use fall in the first category. Hebbian cognitive systems fall in the second.
 ---
 Relevant Notes:
 - [[intelligence and goals are orthogonal so a superintelligence can be maximally competent while pursuing arbitrary or destructive ends]] — the claim being challenged
 - [[specifying human values in code is intractable because our goals contain hidden complexity comparable to visual perception]] — orthogonality is cited as evidence, but this claim may only apply to specification architectures
 - [[intrinsic proactive alignment develops genuine moral capacity through self-awareness empathy and theory of mind rather than external reward optimization]] — the positive case: values emerge from architecture, not specification
 - [[the alignment problem dissolves when human values are continuously woven into the system rather than specified in advance]] — if values and reasoning share a substrate, continuous integration is the natural consequence
 Topics:
 - [[_map]]