m3taversal 1398aa193f theseus: archive 9 primary sources for alignment research program phases 1-3

- What: Source archives for key works by Yudkowsky (AGI Ruin, No Fire Alarm),
  Christiano (What Failure Looks Like, AI Safety via Debate, IDA, ELK),
  Russell (Human Compatible), Drexler (CAIS), and Bostrom (Vulnerable World Hypothesis)
- Why: m3ta directive to ingest primary source materials for alignment researchers.
  These 9 texts are the foundational works underlying claims extracted in PRs #2414,
  #2418, and #2419. Source archives ensure agents can reference primary texts without
  re-fetching and content persists if URLs go down.
- Connections: All 9 sources are marked as processed with claims_extracted linking
  to the specific KB claims they produced.

Pentagon-Agent: Theseus <46864dd4-da71-4719-a1b4-68f7c55854d3>

2026-04-05 23:50:36 +01:00

6 KiB

Raw Blame History

type

title

author

url

date

domain

intake_tier

rationale

proposed_by

format

status

processed_by

processed_date

claims_extracted

enrichments

The Vulnerable World Hypothesis

Published in Global Policy (2019) by Nick Bostrom. This paper introduces a framework for understanding how technological development can create existential risks even in the absence of malicious intent or misaligned AI.

The Urn Model

Bostrom models technological development as drawing balls from an urn:

White balls: Beneficial technologies (most historical inventions)
Gray balls: Technologies with mixed or manageable effects
Black balls: Technologies that, once discovered, destroy civilization by default

The hypothesis: there is some level of technological development at which civilization almost certainly gets devastated by default, unless extraordinary safeguards are in place. The question is not whether black balls exist, but whether we've been lucky so far in not drawing one.

Bostrom argues humanity has avoided black balls largely through luck, not wisdom. Nuclear weapons came close — but the minimum viable nuclear device requires nation-state resources. If nuclear reactions could be triggered by "sending an electric current through metal between glass sheets," civilization would not have survived the 20th century.

Vulnerability Types

Type-0: Surprising Strangelets

Hidden physical risks from experiments. Example: the (dismissed) concern during Trinity testing that a nuclear detonation might ignite Earth's atmosphere. The characteristic feature: we don't know about the risk until we've already triggered it.

Type-1: Easy Nukes

Technologies that enable small groups or individuals to inflict mass destruction. The "easy nukes" thought experiment. If destructive capability becomes cheap and accessible, no governance structure can prevent all misuse by billions of potential actors.

Type-2a: Safe First Strike

Technologies that incentivize powerful actors toward preemptive use because striking first offers decisive advantage. Nuclear first-strike dynamics, but extended to any domain where the attacker has a structural advantage.

Type-2b: Worse Global Warming

Technologies where individual actors face incentives to take small harmful actions that accumulate to civilizational-scale damage. No single actor causes catastrophe, but the aggregate does. Climate change is the existing example; AI-driven economic competition could be another.

The Semi-Anarchic Default Condition

The vulnerable world hypothesis assumes the current global order has:

Limited preventive policing: States can punish after the fact but struggle to prevent determined actors
Limited global governance: No effective mechanism to coordinate all nation-states on technological restrictions
Diverse actor motivations: Among billions of humans, some fraction will intentionally misuse any sufficiently accessible destructive technology

Under this condition, Type-1 vulnerabilities are essentially unsurvivable: if the technology exists and is accessible, someone will use it destructively.

Governance Implications

Bostrom identifies four possible responses:

Restrict technological development: Slow down or halt research in dangerous areas. Problem: competitive dynamics make this unstable (the state that restricts loses to the state that doesn't).
Ensure adequate global governance: Build institutions capable of monitoring and preventing misuse. Problem: requires unprecedented international cooperation.
Effective preventive policing: Mass surveillance sufficient to detect and prevent all destructive uses. Problem: dystopian implications, concentration of power.
Differential technological development: Prioritize defensive technologies and governance mechanisms before offensive capabilities mature. This is Bostrom's preferred approach but requires coordination that the semi-anarchic default condition makes difficult.

AI as Potential Black Ball

Bostrom doesn't focus specifically on AI in this paper, but the framework applies directly:

Superintelligent AI could be a Type-1 vulnerability (anyone who builds it can destroy civilization)
AI-driven economic competition is a Type-2b vulnerability (individual rational actors accumulating aggregate catastrophe)
AI development could discover other black ball technologies (accelerating the urn-drawing process)

Significance for Teleo KB

The Vulnerable World Hypothesis provides the governance-level framing that complements:

Yudkowsky's capability-level arguments (why alignment is technically hard)
Christiano's economic-competition failure mode (why misaligned AI gets deployed)
Alexander's Moloch (why coordination fails even among well-intentioned actors)

The key insight for our thesis: the semi-anarchic default condition is precisely what collective superintelligence architectures could address — providing the coordination mechanism that prevents the urn from being drawn carelessly.

6 KiB Raw Blame History