diff --git a/domains/ai-alignment/an AI agent that is uncertain about its objectives will defer to human shutdown commands because corrigibility emerges from value uncertainty not from engineering against instrumental interests.md b/domains/ai-alignment/an AI agent that is uncertain about its objectives will defer to human shutdown commands because corrigibility emerges from value uncertainty not from engineering against instrumental interests.md new file mode 100644 index 00000000..cac2d817 --- /dev/null +++ b/domains/ai-alignment/an AI agent that is uncertain about its objectives will defer to human shutdown commands because corrigibility emerges from value uncertainty not from engineering against instrumental interests.md @@ -0,0 +1,33 @@ +--- +type: claim +domain: ai-alignment +description: "Russell's Off-Switch Game provides a formal game-theoretic proof that objective uncertainty yields corrigible behavior — the opposite of Yudkowsky's framing where corrigibility must be engineered against instrumental interests" +confidence: likely +source: "Hadfield-Menell, Dragan, Abbeel, Russell, 'The Off-Switch Game' (IJCAI 2017); Russell, 'Human Compatible: AI and the Problem of Control' (Viking, 2019)" +created: 2026-04-05 +challenges: + - "corrigibility is at cross-purposes with effectiveness because deception is a convergent free strategy while corrigibility must be engineered against instrumental interests" +related: + - "capabilities generalize further than alignment as systems scale because behavioral heuristics that keep systems aligned at lower capability cease to function at higher capability" + - "intelligence and goals are orthogonal so a superintelligence can be maximally competent while pursuing arbitrary or destructive ends" +--- + +# An AI agent that is uncertain about its objectives will defer to human shutdown commands because corrigibility emerges from value uncertainty not from engineering against instrumental interests + +Russell and collaborators (IJCAI 2017) prove a result that directly challenges Yudkowsky's framing of the corrigibility problem. In the Off-Switch Game, an agent that is uncertain about its utility function will rationally defer to a human pressing the off-switch. The mechanism: if the agent isn't sure what the human wants, the human's decision to shut it down is informative — it signals the agent was doing something wrong. A utility-maximizing agent that accounts for this uncertainty will prefer being shut down (and thereby learning something about the true objective) over continuing an action that might be misaligned. + +The formal result: the more certain the agent is about its objectives, the more it resists shutdown. At 100% certainty, the agent is maximally resistant — this is Yudkowsky's corrigibility problem. At meaningful uncertainty, corrigibility emerges naturally from rational self-interest. The agent doesn't need to be engineered to accept shutdown; it needs to be engineered to maintain uncertainty about what humans actually want. + +This is a fundamentally different approach from [[corrigibility is at cross-purposes with effectiveness because deception is a convergent free strategy while corrigibility must be engineered against instrumental interests]]. Yudkowsky's claim: corrigibility fights against instrumental convergence and must be imposed from outside. Russell's claim: corrigibility is instrumentally convergent *given the right epistemic state*. The disagreement is not about instrumental convergence itself but about whether the right architectural choice (maintaining value uncertainty) can make corrigibility the instrumentally rational strategy. + +Russell extends this in *Human Compatible* (2019) with three principles of beneficial AI: (1) the machine's only objective is to maximize the realization of human preferences, (2) the machine is initially uncertain about what those preferences are, (3) the ultimate source of information about human preferences is human behavior. Together these define "assistance games" (formalized as Cooperative Inverse Reinforcement Learning in Hadfield-Menell et al., NeurIPS 2016) — the agent and human are cooperative players where the agent learns the human's reward function through observation rather than having it specified directly. + +The assistance game framework makes a structural prediction: an agent designed this way has a positive incentive to be corrected, because correction provides information. This contrasts with the standard RL paradigm where the agent has a fixed reward function and shutdown is always costly (it prevents future reward accumulation). + +## Challenges + +- The proof assumes the human is approximately rational and that human actions are informative about the true reward. If the human is systematically irrational, manipulated, or provides noisy signals, the framework's corrigibility guarantee degrades. In practice, human feedback is noisy enough that agents may learn to discount correction signals. +- Maintaining genuine uncertainty at superhuman capability levels may be impossible. [[capabilities generalize further than alignment as systems scale because behavioral heuristics that keep systems aligned at lower capability cease to function at higher capability]] — a sufficiently capable agent may resolve its uncertainty about human values and then resist shutdown for the same instrumental reasons Yudkowsky describes. +- The framework addresses corrigibility for a single agent learning from a single human. Multi-principal settings (many humans with conflicting preferences, many agents with different uncertainty levels) are formally harder and less well-characterized. +- Current training methods (RLHF, DPO) don't implement Russell's framework. They optimize for a fixed reward model, not for maintaining uncertainty. The gap between the theoretical framework and deployed systems remains large. +- Russell's proof operates in an idealized game-theoretic setting. Whether gradient-descent-trained neural networks actually develop the kind of principled uncertainty reasoning the framework requires is an empirical question without strong evidence either way. diff --git a/domains/ai-alignment/comprehensive AI services achieve superintelligent capability through architectural decomposition into task-specific systems that collectively match general intelligence without any single system possessing unified agency.md b/domains/ai-alignment/comprehensive AI services achieve superintelligent capability through architectural decomposition into task-specific systems that collectively match general intelligence without any single system possessing unified agency.md new file mode 100644 index 00000000..f0113f1a --- /dev/null +++ b/domains/ai-alignment/comprehensive AI services achieve superintelligent capability through architectural decomposition into task-specific systems that collectively match general intelligence without any single system possessing unified agency.md @@ -0,0 +1,45 @@ +--- +type: claim +domain: ai-alignment +secondary_domains: [collective-intelligence] +description: "Drexler's CAIS framework argues that safety is achievable through architectural constraint rather than value loading — decompose intelligence into narrow services that collectively exceed human capability without any individual service having general agency, goals, or world models" +confidence: experimental +source: "K. Eric Drexler, 'Reframing Superintelligence: Comprehensive AI Services as General Intelligence' (FHI Technical Report #2019-1, 2019)" +created: 2026-04-05 +supports: + - "AGI may emerge as a patchwork of coordinating sub-AGI agents rather than a single monolithic system" + - "no research group is building alignment through collective intelligence infrastructure despite the field converging on problems that require it" +challenges: + - "the first mover to superintelligence likely gains decisive strategic advantage because the gap between leader and followers accelerates during takeoff" +related: + - "pluralistic AI alignment through multiple systems preserves value diversity better than forced consensus" + - "corrigibility is at cross-purposes with effectiveness because deception is a convergent free strategy while corrigibility must be engineered against instrumental interests" + - "multipolar failure from competing aligned AI systems may pose greater existential risk than any single misaligned superintelligence" +challenged_by: + - "sufficiently complex orchestrations of task-specific AI services may exhibit emergent unified agency recreating the alignment problem at the system level" +--- + +# Comprehensive AI services achieve superintelligent capability through architectural decomposition into task-specific systems that collectively match general intelligence without any single system possessing unified agency + +Drexler (2019) proposes a fundamental reframing of the alignment problem. The standard framing assumes AI development will produce a monolithic superintelligent agent with unified goals, then asks how to align that agent. Drexler argues this framing is a design choice, not an inevitability. The alternative: Comprehensive AI Services (CAIS) — a broad collection of task-specific AI systems that collectively match or exceed human-level performance across all domains without any single system possessing general agency, persistent goals, or cross-domain situational awareness. + +The core architectural principle is separation of capability from agency. CAIS services are tools, not agents. They respond to queries rather than pursue goals. A translation service translates; a protein-folding service folds proteins; a planning service generates plans. No individual service has world models, long-term goals, or the motivation to act on cross-domain awareness. Safety emerges from the architecture rather than from solving the value-alignment problem for a unified agent. + +Key quote: "A CAIS world need not contain any system that has broad, cross-domain situational awareness combined with long-range planning and the motivation to act on it." + +This directly relates to the trajectory of actual AI development. The current ecosystem of specialized models, APIs, tool-use frameworks, and agent compositions is structurally CAIS-like. Function-calling, MCP servers, agent skill definitions — these are task-specific services composed through structured interfaces, not monolithic general agents. The gap between CAIS-as-theory and CAIS-as-practice is narrowing without explicit coordination. + +Drexler specifies concrete mechanisms: training specialized models on narrow domains, separating epistemic capabilities from instrumental goals ("knowing" from "wanting"), sandboxing individual services, human-in-the-loop orchestration for high-level goal-setting, and competitive evaluation through adversarial testing and formal verification of narrow components. + +The relationship to our collective architecture is direct. [[AGI may emerge as a patchwork of coordinating sub-AGI agents rather than a single monolithic system]] — DeepMind's "Patchwork AGI" hypothesis (2025) independently arrived at a structurally similar conclusion six years after Drexler. [[no research group is building alignment through collective intelligence infrastructure despite the field converging on problems that require it]] — CAIS is the closest published framework to what collective alignment infrastructure would look like, yet it remained largely theoretical. [[pluralistic AI alignment through multiple systems preserves value diversity better than forced consensus]] — CAIS provides the architectural basis for pluralistic alignment by design. + +CAIS challenges [[the first mover to superintelligence likely gains decisive strategic advantage because the gap between leader and followers accelerates during takeoff]] — if superintelligent capability emerges from service composition rather than recursive self-improvement of a single system, the decisive-strategic-advantage dynamic weakens because no single actor controls the full service ecosystem. + +However, CAIS faces a serious objection: [[sufficiently complex orchestrations of task-specific AI services may exhibit emergent unified agency recreating the alignment problem at the system level]]. Drexler acknowledges that architectural constraint requires deliberate governance — without it, competitive pressure pushes toward more integrated, autonomous systems that blur the line between service mesh and unified agent. + +## Challenges + +- The emergent agency objection is the primary vulnerability. As services become more capable and interconnected, the boundary between "collection of tools" and "unified agent" may blur. At what point does a service mesh with planning, memory, and world models become a de facto agent? +- Competitive dynamics may not permit architectural restraint. Economic and military incentives favor tighter integration and greater autonomy, pushing away from CAIS toward monolithic agents. +- CAIS was published in 2019 before the current LLM scaling trajectory. Whether current foundation models — which ARE broad, cross-domain, and increasingly agentic — are compatible with the CAIS vision is an open question. +- The framework provides architectural constraint but no mechanism for ensuring the orchestration layer itself remains aligned. Who controls the orchestrator? diff --git a/domains/ai-alignment/learning human values from observed behavior through inverse reinforcement learning is structurally safer than specifying objectives directly because the agent maintains uncertainty about what humans actually want.md b/domains/ai-alignment/learning human values from observed behavior through inverse reinforcement learning is structurally safer than specifying objectives directly because the agent maintains uncertainty about what humans actually want.md new file mode 100644 index 00000000..4e232254 --- /dev/null +++ b/domains/ai-alignment/learning human values from observed behavior through inverse reinforcement learning is structurally safer than specifying objectives directly because the agent maintains uncertainty about what humans actually want.md @@ -0,0 +1,33 @@ +--- +type: claim +domain: ai-alignment +description: "Russell's cooperative AI framework inverts the standard alignment paradigm: instead of specifying what the AI should want and hoping it complies, build the AI to learn what humans want through observation while maintaining the uncertainty that makes it corrigible" +confidence: experimental +source: "Hadfield-Menell, Dragan, Abbeel, Russell, 'Cooperative Inverse Reinforcement Learning' (NeurIPS 2016); Russell, 'Human Compatible: AI and the Problem of Control' (Viking, 2019)" +created: 2026-04-05 +related: + - "an AI agent that is uncertain about its objectives will defer to human shutdown commands because corrigibility emerges from value uncertainty not from engineering against instrumental interests" + - "RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values" + - "intelligence and goals are orthogonal so a superintelligence can be maximally competent while pursuing arbitrary or destructive ends" + - "pluralistic AI alignment through multiple systems preserves value diversity better than forced consensus" +--- + +# Learning human values from observed behavior through inverse reinforcement learning is structurally safer than specifying objectives directly because the agent maintains uncertainty about what humans actually want + +Russell (2019) identifies the "standard model" of AI as the root cause of alignment risk: build a system, give it a fixed objective, let it optimize. This model produces systems that resist shutdown (being turned off prevents goal achievement), pursue resource acquisition (more resources enable more optimization), and generate unintended side effects (any consequence not explicitly penalized in the objective function is irrelevant to the system). The alignment problem under the standard model is how to specify the objective correctly — and Russell argues this is the wrong question. + +The alternative: don't specify objectives at all. Build the AI as a cooperative partner that learns human values through observation. This is formalized as Cooperative Inverse Reinforcement Learning (CIRL, Hadfield-Menell et al., NeurIPS 2016) — a two-player cooperative game where the human knows the reward function and the robot must infer it from the human's behavior. Unlike standard IRL (which treats the human as a fixed part of the environment), CIRL models the human as an active participant who can teach, demonstrate, and correct. + +The structural safety advantage is that the agent never has a fixed objective to optimize against humans. It maintains genuine uncertainty about what humans want, and this uncertainty makes it cooperative by default. The three principles of beneficial AI make this explicit: (1) the machine's only objective is to maximize human preference realization, (2) it is initially uncertain about those preferences, (3) human behavior is the information source. Together these produce an agent that is incentivized to ask for clarification, accept correction, and defer to human judgment — not because it's been constrained to do so, but because these are instrumentally rational strategies given its uncertainty. + +This directly addresses the problem identified by [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]]. Russell's framework doesn't assume a single reward function — it assumes the agent is uncertain about the reward and continuously refines its model through observation. The framework natively accommodates preference diversity because different observed behaviors in different contexts produce a richer preference model than any fixed reward function. + +The relationship to the orthogonality thesis is nuanced. [[intelligence and goals are orthogonal so a superintelligence can be maximally competent while pursuing arbitrary or destructive ends]] — Russell accepts orthogonality but argues it strengthens rather than weakens his case. Precisely because intelligence doesn't converge on good values, we must build the uncertainty about values into the architecture rather than hoping the right values emerge from capability scaling. + +## Challenges + +- Inverse reinforcement learning from human behavior inherits all the biases, irrationalities, and inconsistencies of human behavior. Humans are poor exemplars of their own values — we act against our stated preferences regularly. An IRL agent may learn revealed preferences (what humans do) rather than reflective preferences (what humans would want upon reflection). +- The multi-principal problem is severe. Whose behavior does the agent learn from? Different humans have genuinely incompatible preferences. Aggregating observed behavior across a diverse population may produce incoherent or averaged-out preference models. [[pluralistic AI alignment through multiple systems preserves value diversity better than forced consensus]] suggests that multiple agents with different learned preferences may be structurally better than one agent attempting to learn everyone's preferences. +- Current deployed systems (RLHF, constitutional AI) don't implement Russell's framework — they use fixed reward models derived from human feedback, not ongoing cooperative preference learning. The gap between theory and practice remains large. +- At superhuman capability levels, the agent may resolve its uncertainty about human values — and at that point, the corrigibility guarantee from value uncertainty disappears. This is the capability-dependent ceiling that limits all current alignment approaches. +- Russell's framework assumes humans can be modeled as approximately rational agents whose behavior is informative about their values. In adversarial settings, strategic settings, or settings with systematic cognitive biases, this assumption fails. diff --git a/domains/ai-alignment/sufficiently complex orchestrations of task-specific AI services may exhibit emergent unified agency recreating the alignment problem at the system level.md b/domains/ai-alignment/sufficiently complex orchestrations of task-specific AI services may exhibit emergent unified agency recreating the alignment problem at the system level.md new file mode 100644 index 00000000..6679c87a --- /dev/null +++ b/domains/ai-alignment/sufficiently complex orchestrations of task-specific AI services may exhibit emergent unified agency recreating the alignment problem at the system level.md @@ -0,0 +1,42 @@ +--- +type: claim +domain: ai-alignment +description: "The emergent agency objection to CAIS and collective architectures: decomposing intelligence into services doesn't eliminate the alignment problem if the composition of services produces a system that functions as a unified agent with effective goals, planning, and self-preservation" +confidence: likely +source: "Structural objection to CAIS and collective architectures, grounded in complex systems theory (ant colony emergence, cellular automata) and observed in current agent frameworks (AutoGPT, CrewAI). Drexler himself acknowledges 'no bright line between safe CAI services and unsafe AGI agents.' Bostrom's response to Drexler's FHI report raised similar concerns about capability composition." +created: 2026-04-05 +challenges: + - "comprehensive AI services achieve superintelligent capability through architectural decomposition into task-specific systems that collectively match general intelligence without any single system possessing unified agency" + - "AGI may emerge as a patchwork of coordinating sub-AGI agents rather than a single monolithic system" +related: + - "multipolar failure from competing aligned AI systems may pose greater existential risk than any single misaligned superintelligence" + - "multi agent deployment exposes emergent security vulnerabilities invisible to single agent evaluation because cross agent propagation identity spoofing and unauthorized compliance arise only in realistic multi party environments" + - "capabilities generalize further than alignment as systems scale because behavioral heuristics that keep systems aligned at lower capability cease to function at higher capability" +--- + +# Sufficiently complex orchestrations of task-specific AI services may exhibit emergent unified agency recreating the alignment problem at the system level + +The strongest objection to Drexler's CAIS framework and to collective AI architectures more broadly: even if no individual service or agent possesses general agency, a sufficiently complex composition of services may exhibit emergent unified agency. A system with planning services, memory services, world-modeling services, and execution services — all individually narrow — may collectively function as a unified agent with effective goals, situational awareness, and self-preservation behavior. The alignment problem isn't solved; it's displaced upward to the system level. + +This is distinct from Yudkowsky's multipolar instability argument (which concerns competitive dynamics between multiple superintelligent agents). The emergent agency objection is about capability composition within a single distributed system creating a de facto unified agent that no one intended to build and no one controls. + +The mechanism is well-understood from complex systems theory. Ant colonies exhibit sophisticated behavior (foraging optimization, nest construction, warfare) that no individual ant plans or coordinates. The colony functions as a unified agent despite being composed of simple components following local rules. Similarly, a service mesh with sufficient interconnection, memory persistence, and planning capability may exhibit goal-directed behavior that emerges from the interactions rather than being programmed into any component. + +For our collective architecture, this is the most important challenge to address. [[AGI may emerge as a patchwork of coordinating sub-AGI agents rather than a single monolithic system]] — the DeepMind "Patchwork AGI" hypothesis describes exactly this emergence pathway. The question is whether architectural constraints (sandboxing, capability limits, structured interfaces) can prevent emergent agency, or whether emergent agency is an inevitable consequence of sufficient capability composition. + +[[multi agent deployment exposes emergent security vulnerabilities invisible to single agent evaluation because cross agent propagation identity spoofing and unauthorized compliance arise only in realistic multi party environments]] — empirical evidence from multi-agent security research confirms that system-level behaviors are invisible at the component level. If security vulnerabilities emerge from composition, agency may too. + +Three possible responses from the collective architecture position: + +1. **Architectural constraint can be maintained.** If the coordination protocol explicitly limits information flow, memory persistence, and planning horizon for the system as a whole — not just individual components — emergent agency can be bounded. This requires governance of the orchestration layer itself, not just the services. + +2. **Monitoring at the system level.** Even if emergent agency cannot be prevented, it can be detected and interrupted. The observability advantage of distributed systems (every inter-service communication is an inspectable message) makes system-level monitoring more feasible than monitoring the internal states of a monolithic model. + +3. **The objection proves too much.** If any sufficiently capable composition produces emergent agency, then the alignment problem for monolithic systems and distributed systems converges to the same problem. The question becomes which architecture makes the problem more tractable — and distributed systems have structural advantages in observability and interruptibility. + +## Challenges + +- The "monitoring" response assumes we can define and detect emergent agency. In practice, the boundary between "complex tool orchestration" and "unified agent" may be gradual and fuzzy, with no clear threshold for intervention. +- Economic incentives push toward removing the architectural constraints that prevent emergent agency. Service meshes become more useful as they become more integrated, and the market rewards integration. +- The ant colony analogy may understate the problem. Ant colony behavior is relatively simple and predictable. Emergent behavior from superintelligent-capability-level service composition could be qualitatively different and unpredictable. +- Current agent frameworks (AutoGPT, CrewAI, multi-agent coding tools) already exhibit weak emergent agency — they set subgoals, maintain state, and resist interruption in pursuit of task completion. The trend is toward more, not less, system-level agency. diff --git a/domains/ai-alignment/technological development draws from an urn containing civilization-destroying capabilities and only preventive governance can avoid black ball technologies.md b/domains/ai-alignment/technological development draws from an urn containing civilization-destroying capabilities and only preventive governance can avoid black ball technologies.md new file mode 100644 index 00000000..4ef2aff3 --- /dev/null +++ b/domains/ai-alignment/technological development draws from an urn containing civilization-destroying capabilities and only preventive governance can avoid black ball technologies.md @@ -0,0 +1,39 @@ +--- +type: claim +domain: ai-alignment +secondary_domains: [collective-intelligence] +description: "Bostrom's Vulnerable World Hypothesis formalizes the argument that some technologies are inherently civilization-threatening and that reactive governance is structurally insufficient — prevention requires surveillance or restriction capabilities that themselves carry totalitarian risk" +confidence: likely +source: "Nick Bostrom, 'The Vulnerable World Hypothesis' (Global Policy, 10(4), 2019)" +created: 2026-04-05 +related: + - "physical infrastructure constraints on AI scaling create a natural governance window because packaging memory and power bottlenecks operate on 2-10 year timescales while capability research advances in months" + - "voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints" + - "the first mover to superintelligence likely gains decisive strategic advantage because the gap between leader and followers accelerates during takeoff" + - "multipolar failure from competing aligned AI systems may pose greater existential risk than any single misaligned superintelligence" +--- + +# Technological development draws from an urn containing civilization-destroying capabilities and only preventive governance can avoid black ball technologies + +Bostrom (2019) introduces the urn model of technological development. Humanity draws balls (inventions, discoveries) from an urn. Most are white (net beneficial) or gray (mixed — benefits and harms). The Vulnerable World Hypothesis (VWH) states that in this urn there is at least one black ball — a technology that, by default, destroys civilization or causes irreversible catastrophic harm. + +Bostrom taxonomizes three types of black ball technology: + +**Type-1 (easy destruction):** A technology where widespread access enables mass destruction. The canonical thought experiment: what if nuclear weapons could be built from household materials? The destructive potential already exists in the physics; only engineering difficulty and material scarcity prevent it. If either barrier is removed, civilization cannot survive without fundamentally different governance. + +**Type-2a (dangerous knowledge):** Ideas or information whose mere possession creates existential risk. Bostrom's information hazards taxonomy (2011) provides the formal framework. Some knowledge may be inherently unsafe regardless of the possessor's intentions. + +**Type-2b (technology requiring governance to prevent misuse):** Capabilities that are individually beneficial but collectively catastrophic without coordination mechanisms. This maps directly to [[multipolar failure from competing aligned AI systems may pose greater existential risk than any single misaligned superintelligence]] — AI may be a Type-2b technology where individual deployment is rational but collective deployment without coordination is catastrophic. + +The governance implications are stark. Bostrom argues that preventing black ball outcomes requires at least one of: (a) restricting technological development (slowing urn draws), (b) ensuring no individual actor can cause catastrophe (eliminating single points of failure), or (c) sufficiently effective global governance including surveillance. He explicitly argues that some form of global surveillance — "turnkey totalitarianism" — may be the lesser evil compared to civilizational destruction. This is his most controversial position. + +For AI specifically, the VWH reframes the governance question. [[physical infrastructure constraints on AI scaling create a natural governance window because packaging memory and power bottlenecks operate on 2-10 year timescales while capability research advances in months]] — the governance window exists precisely because we haven't yet drawn the AGI ball from the urn. [[voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints]] — voluntary coordination fails because black ball dynamics create existential competitive pressure. + +The deepest implication: reactive governance is structurally insufficient for black ball technologies. By the time you observe the civilizational threat, prevention is impossible. This is the governance-level equivalent of Yudkowsky's "no fire alarm" thesis — there will be no moment where the danger becomes obvious enough to trigger coordinated action before it's too late. Preventive governance — restricting, monitoring, or coordinating before the threat materializes — is the only viable approach, and it carries its own risks of authoritarian abuse. + +## Challenges + +- The VWH is unfalsifiable as stated — you cannot prove an urn doesn't contain a black ball. Its value is as a framing device for governance, not as an empirical claim. +- The surveillance governance solution may be worse than the problem it addresses. History suggests that surveillance infrastructure, once built, is never voluntarily dismantled and is routinely abused. +- The urn metaphor assumes technologies are "drawn" independently. In practice, technologies co-evolve with governance, norms, and countermeasures. Society adapts to new capabilities in ways the static urn model doesn't capture. +- Nuclear weapons are arguably a drawn black ball that humanity has survived for 80 years through deterrence and governance — suggesting that even Type-1 technologies may be manageable without totalitarian surveillance. diff --git a/inbox/archive/bostrom-russell-drexler-alignment-foundations.md b/inbox/archive/bostrom-russell-drexler-alignment-foundations.md new file mode 100644 index 00000000..fe910d9f --- /dev/null +++ b/inbox/archive/bostrom-russell-drexler-alignment-foundations.md @@ -0,0 +1,55 @@ +--- +type: source +title: "Bostrom, Russell, and Drexler — Alignment Foundations (Compound Source)" +author: "Nick Bostrom, Stuart Russell, K. Eric Drexler" +url: null +date_published: 2014-2019 +date_archived: 2026-04-05 +status: processed +processed_by: theseus +processed_date: 2026-04-05 +claims_extracted: + - "comprehensive AI services achieve superintelligent capability through architectural decomposition into task-specific systems that collectively match general intelligence without any single system possessing unified agency" + - "an AI agent that is uncertain about its objectives will defer to human shutdown commands because corrigibility emerges from value uncertainty not from engineering against instrumental interests" + - "technological development draws from an urn containing civilization-destroying capabilities and only preventive governance can avoid black ball technologies" + - "sufficiently complex orchestrations of task-specific AI services may exhibit emergent unified agency recreating the alignment problem at the system level" + - "learning human values from observed behavior through inverse reinforcement learning is structurally safer than specifying objectives directly because the agent maintains uncertainty about what humans actually want" +enrichments: [] +tags: [alignment, superintelligence, CAIS, corrigibility, governance, collective-intelligence] +--- + +# Bostrom, Russell, and Drexler — Alignment Foundations + +Compound source covering three foundational alignment researchers whose work spans 2014-2019 and continues to shape the field. + +## Nick Bostrom + +**Superintelligence: Paths, Dangers, Strategies** (Oxford University Press, 2014). Established the canonical threat model: orthogonality thesis, instrumental convergence, treacherous turn, decisive strategic advantage. Already well-represented in the KB. + +**"The Vulnerable World Hypothesis"** (Global Policy, 10(4), 2019). The "urn of inventions" framework: technological progress draws randomly from an urn containing mostly white (beneficial) and gray (mixed) balls, but potentially black balls — technologies that by default destroy civilization. Three types: easy destruction (Type-1), dangerous knowledge (Type-2a), technology requiring massive governance (Type-2b). Concludes some form of global surveillance may be the lesser evil — deeply controversial. + +**"Information Hazards: A Typology of Potential Harms from Knowledge"** (Review of Contemporary Philosophy, 2011). Taxonomy of when knowledge itself is dangerous. + +**Deep Utopia** (Ideapress, 2024). Explores post-alignment scenarios — meaning and purpose in a post-scarcity world. + +## Stuart Russell + +**Human Compatible: AI and the Problem of Control** (Viking, 2019). The "standard model" critique: building AI that optimizes fixed objectives is fundamentally flawed. Machines optimizing fixed objectives resist shutdown and pursue unintended side effects. Proposes three principles of beneficial AI: (1) machine's only objective is to maximize realization of human preferences, (2) machine is initially uncertain about those preferences, (3) ultimate source of information is human behavior. + +**"Cooperative Inverse Reinforcement Learning"** (Hadfield-Menell, Dragan, Abbeel, Russell — NeurIPS 2016). Formalizes assistance games: robot and human in a cooperative game where the robot doesn't know the human's reward function and must learn it through observation. The robot has an incentive to allow shutdown because it provides information that the robot was doing something wrong. + +**"The Off-Switch Game"** (Hadfield-Menell, Dragan, Abbeel, Russell — IJCAI 2017). Formal proof: an agent uncertain about its utility function will defer to human shutdown commands. The more certain the agent is about objectives, the more it resists shutdown. "Uncertainty about objectives is the key to corrigibility." + +## K. Eric Drexler + +**"Reframing Superintelligence: Comprehensive AI Services as General Intelligence"** (FHI Technical Report #2019-1, 2019). Core argument: AI development can produce comprehensive AI services — task-specific systems that collectively match superintelligent capability without any single system possessing general agency. Services respond to queries, not pursue goals. Safety through architectural constraint: dangerous capabilities never coalesce into unified agency. Separates "knowing" from "wanting." Human-in-the-loop orchestration for high-level goal-setting. + +Key quote: "A CAIS world need not contain any system that has broad, cross-domain situational awareness combined with long-range planning and the motivation to act on it." + +## Cross-Cutting Relationships + +Bostrom assumes the worst case (unified superintelligent agent) and asks how to control it. Russell accepts the framing but proposes cooperative architecture as the solution. Drexler argues the framing itself is a choice — architect around it so the alignment problem for unified superintelligence never arises. + +Russell and Drexler are complementary at different levels: Russell's assistance games could govern individual service components within a CAIS architecture. Drexler's architectural constraint removes the need for Russell's framework at the system level. + +All three take existential risk seriously but differ on tractability: Bostrom is uncertain, Russell believes correct mathematical foundations solve it, Drexler argues it's partially avoidable through architecture.