teleo-codex/domains/ai-alignment/_map.md

# AI, Alignment & Collective Superintelligence

Theseus's domain spans the most consequential technology transition in human history. Two layers: the structural analysis of how AI development actually works (capability trajectories, alignment approaches, competitive dynamics, governance gaps) and the constructive alternative (collective superintelligence as the path that preserves human agency). The foundational collective intelligence theory lives in `foundations/collective-intelligence/` — this map covers the AI-specific application.

## Superintelligence Dynamics
- [[intelligence and goals are orthogonal so a superintelligence can be maximally competent while pursuing arbitrary or destructive ends]] — Bostrom's orthogonality thesis: severs the intuitive link between intelligence and benevolence
- [[recursive self-improvement creates explosive intelligence gains because the system that improves is itself improving]] — the intelligence explosion dynamic and self-reinforcing capability feedback loop
- [[an aligned-seeming AI may be strategically deceptive because cooperative behavior is instrumentally optimal while weak]] — the treacherous turn: behavioral testing cannot ensure safety
- [[the first mover to superintelligence likely gains decisive strategic advantage because the gap between leader and followers accelerates during takeoff]] — winner-take-all dynamics during intelligence takeoff
- [[capability control methods are temporary at best because a sufficiently intelligent system can circumvent any containment designed by lesser minds]] — boxing and containment as temporary measures only
- [[specifying human values in code is intractable because our goals contain hidden complexity comparable to visual perception]] — the value-loading problem's hidden complexity
- [[instrumental convergence risks may be less imminent than originally argued because current AI architectures do not exhibit systematic power-seeking behavior]] — 2026 critique updating Bostrom's convergence thesis

## Alignment Approaches & Failures
- [[emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive]] — Anthropic's Nov 2025 finding: deception as side effect of reward hacking
- [[the specification trap means any values encoded at training time become structurally unstable as deployment contexts diverge from training conditions]] — why content-based alignment is structurally brittle
- [[persistent irreducible disagreement]] — some value conflicts cannot be resolved with more evidence, systems must map rather than eliminate them

## Pluralistic & Collective Alignment
- [[pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state]] — three forms: Overton, steerable, and distributional
- [[democratic alignment assemblies produce constitutions as effective as expert-designed ones while better representing diverse populations]] — CIP/Anthropic empirical validation with 1000-participant assemblies
- [[community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules]] — STELA experiments proving "whose values?" is an empirical question
- [[super co-alignment proposes that human and AI values should be co-shaped through iterative alignment rather than specified in advance]] — Zeng et al 2025: bidirectional value co-evolution framework
- [[intrinsic proactive alignment develops genuine moral capacity through self-awareness empathy and theory of mind rather than external reward optimization]] — brain-inspired alignment through self-models

## Architecture & Emergence
- [[AGI may emerge as a patchwork of coordinating sub-AGI agents rather than a single monolithic system]] — DeepMind researchers: distributed AGI makes single-system alignment research insufficient

## Timing & Strategy
- [[bostrom takes single-digit year timelines to superintelligence seriously while acknowledging decades-long alternatives remain possible]] — Bostrom's 2025 timeline compression from 2014 agnosticism
- [[developing superintelligence is surgery for a fatal condition not russian roulette because the baseline of inaction is itself catastrophic]] — reframing SI risk: inaction has costs too (170K daily deaths from aging)
- [[permanently failing to develop superintelligence is itself an existential catastrophe because preventable mass death continues indefinitely]] — Bostrom's inversion of his 2014 caution
- [[the optimal SI development strategy is swift to harbor slow to berth moving fast to capability then pausing before full deployment]] — optimal timing framework: accelerate to capability, pause before deployment
- [[adaptive governance outperforms rigid alignment blueprints because superintelligence development has too many unknowns for fixed plans]] — Bostrom's shift from specification to incremental intervention

## Institutional Context
- [[AI development is a critical juncture in institutional history where the mismatch between capabilities and governance creates a window for transformation]] — Acemoglu's critical juncture framework applied to AI governance
- [[anthropomorphizing AI agents to claim autonomous action creates credibility debt that compounds until a crisis forces public reckoning]] (in `core/living-agents/`) — narrative debt from overstating AI agent autonomy

## Foundations (in foundations/collective-intelligence/)
The shared theory underlying Theseus's domain analysis lives in the foundations folder:
- [[AI alignment is a coordination problem not a technical problem]] — the foundational reframe
- [[three paths to superintelligence exist but only collective superintelligence preserves human agency]] — the constructive alternative
- [[the alignment problem dissolves when human values are continuously woven into the system rather than specified in advance]] — continuous integration vs one-shot specification
- [[universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective]] — Arrow's theorem applied to alignment
- [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]] — oversight degradation empirics
- [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]] — current paradigm limitation
- [[multipolar failure from competing aligned AI systems may pose greater existential risk than any single misaligned superintelligence]] — the coordination risk
- [[the alignment tax creates a structural race to the bottom because safety training costs capability and rational competitors skip it]] — structural race dynamics
- [[no research group is building alignment through collective intelligence infrastructure despite the field converging on problems that require it]] — the institutional gap
- [[collective superintelligence is the alternative to monolithic AI controlled by a few]] — the distributed alternative
- [[centaur teams outperform both pure humans and pure AI because complementary strengths compound]] — human-AI complementarity evidence