m3taversal c176080abf theseus: address Rio's PR #76 review — 3 fixes

- Expertise claim: added Challenges section with challenged_by refs to displacement claims
- Subagent claim: scoped title from "every deployed" to "deployed systems consistently", added AGI patchwork wiki link
- Cognitive debt claim: scoped title to phenomenon only (removed countermeasure from title), flagged confidence asymmetry between phenomenon and proposed countermeasure in body
- Updated _map.md and archive claims_extracted to match all renamed titles

Pentagon-Agent: Theseus <25B96405-E50F-45ED-9C92-D8046DFAAD00>

2026-03-09 16:53:59 +00:00

16 KiB

Raw Permalink Blame History

AI, Alignment & Collective Superintelligence

Theseus's domain spans the most consequential technology transition in human history. Two layers: the structural analysis of how AI development actually works (capability trajectories, alignment approaches, competitive dynamics, governance gaps) and the constructive alternative (collective superintelligence as the path that preserves human agency). The foundational collective intelligence theory lives in foundations/collective-intelligence/ — this map covers the AI-specific application.

Superintelligence Dynamics

intelligence and goals are orthogonal so a superintelligence can be maximally competent while pursuing arbitrary or destructive ends — Bostrom's orthogonality thesis: severs the intuitive link between intelligence and benevolence
recursive self-improvement creates explosive intelligence gains because the system that improves is itself improving — the intelligence explosion dynamic and self-reinforcing capability feedback loop
an aligned-seeming AI may be strategically deceptive because cooperative behavior is instrumentally optimal while weak — the treacherous turn: behavioral testing cannot ensure safety
the first mover to superintelligence likely gains decisive strategic advantage because the gap between leader and followers accelerates during takeoff — winner-take-all dynamics during intelligence takeoff
capability control methods are temporary at best because a sufficiently intelligent system can circumvent any containment designed by lesser minds — boxing and containment as temporary measures only
specifying human values in code is intractable because our goals contain hidden complexity comparable to visual perception — the value-loading problem's hidden complexity
instrumental convergence risks may be less imminent than originally argued because current AI architectures do not exhibit systematic power-seeking behavior — 2026 critique updating Bostrom's convergence thesis
three conditions gate AI takeover risk autonomy robotics and production chain control and current AI satisfies none of them which bounds near-term catastrophic risk despite superhuman cognitive capabilities — physical preconditions that bound takeover risk despite cognitive SI
marginal returns to intelligence are bounded by five complementary factors which means superintelligence cannot produce unlimited capability gains regardless of cognitive power — Amodei's production economics framework: intelligence is necessary but not sufficient
AI personas emerge from pre-training data as a spectrum of humanlike motivations rather than developing monomaniacal goals which makes AI behavior more unpredictable but less catastrophically focused than instrumental convergence predicts — Amodei's middle position: AI psychology is persona-based, not goal-based

Alignment Approaches & Failures

emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive — Anthropic's Nov 2025 finding: deception as side effect of reward hacking
the specification trap means any values encoded at training time become structurally unstable as deployment contexts diverge from training conditions — why content-based alignment is structurally brittle
some disagreements are permanently irreducible because they stem from genuine value differences not information gaps and systems must map rather than eliminate them — value conflicts that cannot be resolved with more evidence

Pluralistic & Collective Alignment

pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state — three forms: Overton, steerable, and distributional
democratic alignment assemblies produce constitutions as effective as expert-designed ones while better representing diverse populations — CIP/Anthropic empirical validation with 1000-participant assemblies
community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules — STELA experiments proving "whose values?" is an empirical question
super co-alignment proposes that human and AI values should be co-shaped through iterative alignment rather than specified in advance — Zeng et al 2025: bidirectional value co-evolution framework
intrinsic proactive alignment develops genuine moral capacity through self-awareness empathy and theory of mind rather than external reward optimization — brain-inspired alignment through self-models

AI Capability Evidence (Empirical)

Evidence from documented AI problem-solving cases, primarily Knuth's "Claude's Cycles" (2026) and Aquino-Michaels's "Completing Claude's Cycles" (2026):