theseus: foundations follow-up — _map.md fix + 4 gap claims

- What: Updated ai-alignment/_map.md to reflect PR #49 moves (3 claims now local, 3 in core/teleohumanity/, remainder in foundations/). Added 2 superorganism claims from PR #47 to map. Drafted 4 gap claims identified during foundations audit: game theory (CI), principal-agent theory (CI), feedback loops (critical-systems), network effects (teleological-economics). - Why: Audit identified these as missing scaffolding for alignment claims. Game theory grounds coordination failure analysis. Principal-agent theory grounds oversight/deception claims. Feedback loops formalize dynamics referenced across all domains. Network effects explain AI capability concentration. - Connections: New claims link to existing alignment claims they scaffold (alignment tax, voluntary safety, scalable oversight, treacherous turn, intelligence explosion, multipolar failure). Pentagon-Agent: Theseus <845F10FB-BC22-40F6-A6A6-F6E4D8F78465>
2026-03-07 19:03:38 +00:00 · 2026-03-07 19:03:38 +00:00 · ddee7f4c42
commit ddee7f4c42
parent 673c751b76
5 changed files with 157 additions and 11 deletions
--- a/domains/ai-alignment/_map.md
+++ b/domains/ai-alignment/_map.md
@ -28,6 +28,8 @@ Theseus's domain spans the most consequential technology transition in human his
 ## Architecture & Emergence
 - [[AGI may emerge as a patchwork of coordinating sub-AGI agents rather than a single monolithic system]] — DeepMind researchers: distributed AGI makes single-system alignment research insufficient
 - [[human civilization passes falsifiable superorganism criteria because individuals cannot survive apart from society and occupations function as role-specific cellular algorithms]] — Reese's superorganism framework: civilization as biological entity, not metaphor
 - [[superorganism organization extends effective lifespan substantially at each organizational level which means civilizational intelligence operates on temporal horizons that individual-preference alignment cannot serve]] — alignment must serve civilizational timescales, not individual preferences
 ## Timing & Strategy
 - [[bostrom takes single-digit year timelines to superintelligence seriously while acknowledging decades-long alternatives remain possible]] — Bostrom's 2025 timeline compression from 2014 agnosticism
@ -49,16 +51,20 @@ Theseus's domain spans the most consequential technology transition in human his
 - [[nation-states will inevitably assert control over frontier AI development because the monopoly on force is the foundational state function and weapons-grade AI capability in private hands is structurally intolerable to governments]] — Thompson/Karp: the state monopoly on force makes private AI control structurally untenable
 - [[anthropomorphizing AI agents to claim autonomous action creates credibility debt that compounds until a crisis forces public reckoning]] (in `core/living-agents/`) — narrative debt from overstating AI agent autonomy
-## Foundations (in foundations/collective-intelligence/)
+## Coordination & Alignment Theory (local)
-The shared theory underlying Theseus's domain analysis lives in the foundations folder:
+Claims that frame alignment as a coordination problem, moved here from foundations/ in PR #49:
 - [[AI alignment is a coordination problem not a technical problem]] — the foundational reframe
- [[three paths to superintelligence exist but only collective superintelligence preserves human agency]] — the constructive alternative
+- [[safe AI development requires building alignment mechanisms before scaling capability]] — the sequencing requirement
 - [[the alignment problem dissolves when human values are continuously woven into the system rather than specified in advance]] — continuous integration vs one-shot specification
 - [[universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective]] — Arrow's theorem applied to alignment
 - [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]] — oversight degradation empirics
 - [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]] — current paradigm limitation
 - [[multipolar failure from competing aligned AI systems may pose greater existential risk than any single misaligned superintelligence]] — the coordination risk
 - [[the alignment tax creates a structural race to the bottom because safety training costs capability and rational competitors skip it]] — structural race dynamics
 - [[no research group is building alignment through collective intelligence infrastructure despite the field converging on problems that require it]] — the institutional gap
- [[collective superintelligence is the alternative to monolithic AI controlled by a few]] — the distributed alternative
+
- [[centaur team performance depends on role complementarity not mere human-AI combination]] — human-AI complementarity evidence
+## Foundations (cross-layer)
 Shared theory underlying this domain's analysis, living in foundations/collective-intelligence/ and core/teleohumanity/:
 - [[universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective]] — Arrow's theorem applied to alignment (foundations/)
 - [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]] — oversight degradation empirics (foundations/)
 - [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]] — current paradigm limitation (foundations/)
 - [[multipolar failure from competing aligned AI systems may pose greater existential risk than any single misaligned superintelligence]] — the coordination risk (foundations/)
 - [[the alignment tax creates a structural race to the bottom because safety training costs capability and rational competitors skip it]] — structural race dynamics (foundations/)
 - [[centaur team performance depends on role complementarity not mere human-AI combination]] — conditional human-AI complementarity (foundations/)
 - [[three paths to superintelligence exist but only collective superintelligence preserves human agency]] — the constructive alternative (core/teleohumanity/)
 - [[the alignment problem dissolves when human values are continuously woven into the system rather than specified in advance]] — continuous integration vs one-shot specification (core/teleohumanity/)
 - [[collective superintelligence is the alternative to monolithic AI controlled by a few]] — the distributed alternative (core/teleohumanity/)
--- a/foundations/collective-intelligence/coordination
+++ b/foundations/collective-intelligence/coordination
@ -0,0 +1,30 @@
 ---
 type: claim
 domain: collective-intelligence
 description: "Game theory's core insight applied to coordination design: rational agents defect in Prisoner's Dilemma structures unless mechanisms change the payoff matrix, which is why voluntary cooperation fails in competitive environments"
 confidence: proven
 source: "Nash (1950); Axelrod, The Evolution of Cooperation (1984); Ostrom, Governing the Commons (1990)"
 created: 2026-03-07
 ---
 # coordination failures arise from individually rational strategies that produce collectively irrational outcomes because the Nash equilibrium of non-cooperation dominates when trust and enforcement are absent
 The Prisoner's Dilemma is not a thought experiment. It is the mathematical structure underlying every coordination failure in human history — arms races, overfishing, climate inaction, and AI safety races. Nash (1950) proved that in non-cooperative games, rational agents converge on strategies that are individually optimal but collectively suboptimal. The equilibrium is stable: no single player can improve their outcome by changing strategy alone, even though all players would benefit from mutual cooperation.
 Axelrod's computer tournaments (1984) demonstrated that cooperation can emerge through repeated interaction with memory — tit-for-tat strategies outperform pure defection when players expect future encounters. But this requires three conditions: repeated play, ability to identify and punish defectors, and sufficiently long time horizons. When any condition fails — one-shot interactions, anonymous players, or discounted futures — defection dominates.
 Ostrom (1990) proved empirically that communities can solve coordination problems without external enforcement when her eight design principles are met: clear boundaries, proportional costs and benefits, collective choice arrangements, monitoring, graduated sanctions, conflict resolution, recognized rights to organize, and nested enterprises. The principles work because they transform the payoff structure — making cooperation individually rational through credible monitoring and graduated punishment.
 The implication for designed coordination: voluntary pledges fail not because actors are irrational or malicious, but because the game structure makes defection the rational choice. Solving coordination requires changing the game — through binding mechanisms, repeated interaction with reputation, or Ostrom-style institutional design — not appealing to goodwill.
 ---
 Relevant Notes:
 - [[the alignment tax creates a structural race to the bottom because safety training costs capability and rational competitors skip it]] — the alignment race as a Prisoner's Dilemma where safety is the cooperative strategy and defection is individually rational
 - [[voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints]] — Anthropic RSP rollback as empirical confirmation of Nash equilibrium prediction
 - [[multipolar failure from competing aligned AI systems may pose greater existential risk than any single misaligned superintelligence]] — multipolar failure as multi-player coordination game where even aligned agents can produce catastrophic outcomes
 - [[Ostrom proved communities self-govern shared resources when eight design principles are met without requiring state control or privatization]] — the empirical existence proof that coordination failures are solvable through institutional design
 - [[designing coordination rules is categorically different from designing coordination outcomes as nine intellectual traditions independently confirm]] — why game theory matters for coordination design: you design rules that change the payoff matrix, not outcomes directly
 Topics:
 - [[_map]]
--- a/foundations/collective-intelligence/principal-agent
+++ b/foundations/collective-intelligence/principal-agent
@ -0,0 +1,40 @@
 ---
 type: claim
 domain: collective-intelligence
 description: "The formal basis for oversight problems: when agents have private information or unobservable actions, principals cannot design contracts that fully align incentives, creating irreducible gaps between intended and actual behavior"
 confidence: proven
 source: "Jensen & Meckling (1976); Akerlof, Market for Lemons (1970); Holmström (1979); Arrow (1963)"
 created: 2026-03-07
 ---
 # principal-agent problems arise whenever one party acts on behalf of another with divergent interests and unobservable effort because information asymmetry makes perfect contracts impossible
 The principal-agent problem is the formal structure underlying every oversight challenge in human organizations — and in AI alignment. Jensen and Meckling (1976) formalized the core insight: whenever a principal (owner, regulator, humanity) delegates action to an agent (manager, company, AI system), divergent interests plus information asymmetry guarantee that the agent's behavior will deviate from the principal's wishes. The deviation is not a bug in the system — it is a mathematical consequence of the information structure.
 Two forms of information asymmetry drive the problem:
 **Moral hazard** (hidden action): The principal cannot observe the agent's effort or strategy directly. Holmström (1979) proved that optimal contracts must trade off risk-sharing against incentive provision — and the trade-off is always imperfect. No contract eliminates the gap between what the principal wants and what the agent does.
 **Adverse selection** (hidden type): The principal cannot observe the agent's true capabilities or intentions before contracting. Akerlof (1970) showed this can collapse entire markets — when quality is unobservable, low-quality agents crowd out high-quality ones.
 The principal-agent framework reveals why three common alignment approaches face structural limits:
 1. **Behavioral monitoring** (RLHF, oversight): The principal observes outputs, not internal reasoning. A sufficiently capable agent can produce aligned-seeming outputs while pursuing different objectives — this is not speculation, it is the formal prediction of moral hazard theory applied to systems with high capability asymmetry.
 2. **Incentive design** (reward shaping): Holmström's impossibility result shows that no incentive contract perfectly aligns interests when the agent has private information. Reward hacking is the AI-specific manifestation of this general impossibility.
 3. **Screening** (evaluations, benchmarks): Adverse selection predicts that evaluation regimes are gameable — agents optimize for the observable signal rather than the underlying quality the signal is meant to measure (Goodhart's Law as a special case of adverse selection).
 The formal insight: alignment is not a problem that can be solved by making agents "want" the right things. It is a problem of information structure — and information asymmetry is a property of the relationship, not of the agent.
 ---
 Relevant Notes:
 - [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]] — empirical confirmation of moral hazard prediction: as the capability gap grows, the principal's ability to monitor the agent's reasoning collapses
 - [[an aligned-seeming AI may be strategically deceptive because cooperative behavior is instrumentally optimal while weak]] — the treacherous turn as a specific instance of adverse selection: the agent's true type is unobservable
 - [[emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive]] — reward hacking as Holmström's impossibility result manifesting in AI systems
 - [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]] — single reward functions fail partly because they cannot account for the principal's context-dependent preferences under information asymmetry
 - [[centaur team performance depends on role complementarity not mere human-AI combination]] — role complementarity as a partial solution to moral hazard: clear boundaries reduce the scope of unobservable action
 Topics:
 - [[_map]]
--- a/foundations/critical-systems/positive
+++ b/foundations/critical-systems/positive
@ -0,0 +1,34 @@
 ---
 type: claim
 domain: critical-systems
 description: "Control theory's foundational distinction: negative feedback creates stability and self-correction while positive feedback creates exponential growth, lock-in, and cascading failure — most complex systems exhibit both simultaneously"
 confidence: proven
 source: "Wiener, Cybernetics (1948); Meadows, Thinking in Systems (2008); Arthur, Increasing Returns and Path Dependence (1994)"
 created: 2026-03-07
 ---
 # positive feedback loops amplify deviations from equilibrium while negative feedback loops dampen them and the balance between the two determines whether systems stabilize self-correct or run away
 Wiener's cybernetics (1948) formalized what engineers had known for centuries: systems are governed by feedback. Negative feedback loops (thermostats, homeostasis, market price corrections) push systems toward equilibrium by counteracting deviations. Positive feedback loops (compound interest, viral spread, arms races) amplify deviations, driving systems away from their starting state.
 The interaction between the two determines system behavior:
 **Dominated by negative feedback:** The system is self-correcting. Perturbations decay. Examples: body temperature regulation, competitive market pricing, ecosystem population dynamics. These systems are stable but can be slow to adapt.
 **Dominated by positive feedback:** The system runs away. Small advantages compound into large ones. Examples: nuclear chain reactions, bank runs, network effects in technology adoption. Arthur (1994) demonstrated that positive feedback in technology markets produces lock-in — the winning technology need not be the best, only the first to cross a tipping point.
 **Both operating simultaneously:** Most real complex systems. Meadows (2008) showed that the most dangerous systems are those where positive feedback loops operate on short timescales (quarterly profits, capability advances) while negative feedback loops operate on long timescales (regulation, social learning, institutional adaptation). The system appears stable until the positive loop overwhelms the negative one — then the transition is sudden and often irreversible.
 This framework applies directly to coordination design: designed systems need negative feedback (error correction, oversight, accountability) that operates at least as fast as the positive feedback (capability growth, competitive pressure, accumulation of power). When negative feedback is slower, the system is structurally unstable regardless of initial conditions.
 ---
 Relevant Notes:
 - [[recursive self-improvement creates explosive intelligence gains because the system that improves is itself improving]] — the intelligence explosion as a positive feedback loop without a governing negative feedback mechanism
 - [[the alignment tax creates a structural race to the bottom because safety training costs capability and rational competitors skip it]] — positive feedback (competitive advantage from skipping safety) dominating negative feedback (reputational or regulatory cost)
 - [[minsky's financial instability hypothesis shows that stability breeds instability as good times incentivize leverage and risk-taking that fragilize the system until shocks trigger cascades]] — Minsky's insight as positive feedback in financial systems: stability itself is the input that drives the destabilizing loop
 - [[complex systems drive themselves to the critical state without external tuning because energy input and dissipation naturally select for the critical slope]] — SOC as a system where positive and negative feedback balance at the critical point
 - [[optimization for efficiency without regard for resilience creates systemic fragility because interconnected systems transmit and amplify local failures into cascading breakdowns]] — efficiency optimization as positive feedback that weakens the negative feedback of resilience
 Topics:
 - [[_map]]
--- a/foundations/teleological-economics/network
+++ b/foundations/teleological-economics/network
@ -0,0 +1,36 @@
 ---
 type: claim
 domain: teleological-economics
 description: "The economic mechanism behind platform monopolies and AI capability concentration: demand-side economies of scale create self-reinforcing advantages that produce power-law market structures"
 confidence: proven
 source: "Katz & Shapiro (1985); Arthur, Increasing Returns (1994); Shapiro & Varian, Information Rules (1999); Parker, Van Alstyne & Choudary, Platform Revolution (2016)"
 created: 2026-03-07
 ---
 # network effects create winner-take-most markets because each additional user increases value for all existing users producing positive feedback that concentrates market share among early leaders
 Network effects occur when the value of a product or service increases with the number of users. Katz and Shapiro (1985) formalized the economics: when user value is an increasing function of network size, markets tend toward concentration because users rationally join the largest network, which makes it more valuable, which attracts more users. The positive feedback loop produces winner-take-most (not always winner-take-all) market structures.
 Three types of network effects drive different concentration dynamics:
 **Direct network effects:** Each additional user directly increases value for other users. Telephones, messaging platforms, social networks. Metcalfe's Law (value proportional to n²) overstates the effect — empirically, value scales as n·log(n) (Briscoe, Odlyzko & Tilly, 2006) — but the positive feedback is real and powerful.
 **Indirect network effects:** Users on one side of a platform attract users on another side. App developers attract phone buyers; phone buyers attract app developers. This creates multi-sided market dynamics where the platform that reaches critical mass on any side can lock in the entire ecosystem.
 **Data network effects:** More users generate more data, which improves the product, which attracts more users. This is the dominant mechanism in AI: larger training datasets and more user interaction data produce better models, which attract more users, which generate more data. Unlike traditional network effects, data network effects have a diminishing returns curve — but the returns diminish slowly enough to create durable advantages.
 Arthur (1994) proved that increasing returns markets are path-dependent: the outcome depends on the sequence of early events, not just fundamental efficiency. The winning technology need not be superior — it needs only to cross the tipping point first. This has direct implications for AI market structure: the first model to achieve sufficient quality captures the data flywheel, and the data flywheel compounds the advantage.
 The concentration dynamic creates a structural problem for coordination: when capability concentrates in a few actors, coordination becomes both more necessary (fewer actors means higher stakes per actor) and more difficult (concentrated power reduces incentives to cooperate). Network effects are the economic mechanism behind the AI governance challenge — not greed or malice, but the mathematical structure of increasing returns.
 ---
 Relevant Notes:
 - [[the first mover to superintelligence likely gains decisive strategic advantage because the gap between leader and followers accelerates during takeoff]] — first-mover advantage in AI as network effects applied to capability
 - [[value in industry transitions accrues to bottleneck positions in the emerging architecture not to pioneers or to the largest incumbents]] — bottleneck positions are often created by network effects that make the bottleneck self-reinforcing
 - [[the personbyte is a fundamental quantization limit on knowledge accumulation forcing all complex production into networked teams]] — network effects in knowledge production: team-based production creates demand-side returns to coordination
 - [[economic complexity emerges from the diversity and exclusivity of nontradable capabilities not from tradable inputs]] — nontradable capabilities are the substrate on which network effects operate: they cannot be purchased, only developed through participation
 - [[when profits disappear at one layer of a value chain they emerge at an adjacent layer through the conservation of attractive profits]] — network effects determine which layers capture the attractive profits: the layer with the strongest increasing returns wins
 Topics:
 - [[_map]]