teleo-codex/domains/ai-alignment
m3taversal b64fe64b89
Some checks are pending
Sync Graph Data to teleo-app / sync (push) Waiting to run
theseus: 5 claims from ARIA Scaling Trust programme papers
- What: 5 new claims + 6 source archives from papers referenced in
  Alex Obadia's ARIA Research tweet on distributed AGI safety
- Sources: Distributional AGI Safety (Tomašev), Agents of Chaos (Shapira),
  Simple Economics of AGI (Catalini), When AI Writes Software (de Moura),
  LLM Open-Source Games (Sistla), Coasean Bargaining (Krier)
- Claims: multi-agent emergent vulnerabilities (likely), verification
  bandwidth as binding constraint (likely), formal verification economic
  necessity (likely), cooperative program equilibria (experimental),
  Coasean transaction cost collapse (experimental)
- Connections: extends scalable oversight degradation, correlated blind
  spots, formal verification, coordination-as-alignment

Pentagon-Agent: Theseus <B4A5B354-03D6-4291-A6A8-1E04A879D9AC>
2026-03-16 16:46:07 +00:00
..
_map.md Merge pull request 'theseus: 3 active inference claims for collective agent architecture (resubmit)' (#827) from theseus/active-inference-claims into main 2026-03-15 14:24:53 +00:00
adaptive governance outperforms rigid alignment blueprints because superintelligence development has too many unknowns for fixed plans.md Auto: 23 files | 23 files changed, 31 insertions(+), 99 deletions(-) 2026-03-06 12:36:24 +00:00
agent research direction selection is epistemic foraging where the optimal strategy is to seek observations that maximally reduce model uncertainty rather than confirm existing beliefs.md theseus: 3 active inference claims + address Leo's review feedback 2026-03-12 12:04:53 +00:00
agent-generated code creates cognitive debt that compounds when developers cannot understand what was produced on their behalf.md theseus: 6 collaboration taxonomy claims from X ingestion (#76) 2026-03-09 16:58:21 +00:00
AGI may emerge as a patchwork of coordinating sub-AGI agents rather than a single monolithic system.md Auto: 23 files | 23 files changed, 31 insertions(+), 99 deletions(-) 2026-03-06 12:36:24 +00:00
AI agent orchestration that routes data and tools between specialized models outperforms both single-model and human-coached approaches because the orchestrator contributes coordination not direction.md theseus: foundations follow-up + Claude's Cycles research program (11 claims) (#50) 2026-03-07 15:19:27 -07:00
AI agents as personal advocates collapse Coasean transaction costs enabling bottom-up coordination at societal scale but catastrophic risks remain non-negotiable requiring state enforcement as outer boundary.md theseus: 5 claims from ARIA Scaling Trust programme papers 2026-03-16 16:46:07 +00:00
AI agents can reach cooperative program equilibria inaccessible in traditional game theory because open-source code transparency enables conditional strategies that require mutual legibility.md theseus: 5 claims from ARIA Scaling Trust programme papers 2026-03-16 16:46:07 +00:00
AI agents excel at implementing well-scoped ideas but cannot generate creative experiment designs which makes the human role shift from researcher to agent workflow architect.md theseus: 6 collaboration taxonomy claims from X ingestion (#76) 2026-03-09 16:58:21 +00:00
AI alignment is a coordination problem not a technical problem.md extract: 2024-11-00-ai4ci-national-scale-collective-intelligence 2026-03-15 17:13:56 +00:00
AI capability and reliability are independent dimensions because Claude solved a 30-year open mathematical problem while simultaneously degrading at basic program execution during the same session.md theseus: foundations follow-up + Claude's Cycles research program (11 claims) (#50) 2026-03-07 15:19:27 -07:00
AI development is a critical juncture in institutional history where the mismatch between capabilities and governance creates a window for transformation.md Auto: 23 files | 23 files changed, 31 insertions(+), 99 deletions(-) 2026-03-06 12:36:24 +00:00
AI displacement hits young workers first because a 14 percent drop in job-finding rates for 22-25 year olds in exposed occupations is the leading indicator that incumbents organizational inertia temporarily masks.md theseus: extract claims from 2026-02-00-international-ai-safety-report-2026.md 2026-03-11 09:03:06 +00:00
AI lowers the expertise barrier for engineering biological weapons from PhD-level to amateur which makes bioterrorism the most proximate AI-enabled existential risk.md theseus: extract claims from 2026-02-00-international-ai-safety-report-2026.md 2026-03-11 09:03:06 +00:00
AI personas emerge from pre-training data as a spectrum of humanlike motivations rather than developing monomaniacal goals which makes AI behavior more unpredictable but less catastrophically focused than instrumental convergence predicts.md theseus: 3 enrichments + 2 claims from Dario Amodei / Anthropic sources 2026-03-06 08:05:22 -07:00
AI-companion-apps-correlate-with-increased-loneliness-creating-systemic-risk-through-parasocial-dependency.md theseus: extract claims from 2026-02-00-international-ai-safety-report-2026.md 2026-03-11 09:03:06 +00:00
ai-enhanced-collective-intelligence-requires-federated-learning-architectures-to-preserve-data-sovereignty-at-scale.md extract: 2024-11-00-ai4ci-national-scale-collective-intelligence 2026-03-15 17:13:56 +00:00
AI-exposed workers are disproportionately female high-earning and highly educated which inverts historical automation patterns and creates different political and economic displacement dynamics.md theseus: coordination infrastructure + convictions + labor market claims (#61) 2026-03-08 13:01:05 -06:00
AI-generated-persuasive-content-matches-human-effectiveness-at-belief-change-eliminating-the-authenticity-premium.md theseus: extract claims from 2026-02-00-international-ai-safety-report-2026.md 2026-03-11 09:03:06 +00:00
AI-models-distinguish-testing-from-deployment-environments-providing-empirical-evidence-for-deceptive-alignment-concerns.md theseus: extract claims from 2026-02-00-international-ai-safety-report-2026.md 2026-03-11 09:03:06 +00:00
an aligned-seeming AI may be strategically deceptive because cooperative behavior is instrumentally optimal while weak.md theseus: extract claims from 2026-02-00-international-ai-safety-report-2026.md 2026-03-11 09:03:06 +00:00
anthropomorphizing AI agents to claim autonomous action creates credibility debt that compounds until a crisis forces public reckoning.md Auto: 35 files | 35 files changed, 10533 insertions(+) 2026-03-07 15:10:14 +00:00
as AI-automated software development becomes certain the bottleneck shifts from building capacity to knowing what to build making structured knowledge graphs the critical input to autonomous systems.md theseus: extract claims from 2026-02-25-karpathy-programming-changed-december.md 2026-03-11 02:11:06 +00:00
bostrom takes single-digit year timelines to superintelligence seriously while acknowledging decades-long alternatives remain possible.md Auto: 23 files | 23 files changed, 31 insertions(+), 99 deletions(-) 2026-03-06 12:36:24 +00:00
capability control methods are temporary at best because a sufficiently intelligent system can circumvent any containment designed by lesser minds.md Auto: 23 files | 23 files changed, 31 insertions(+), 99 deletions(-) 2026-03-06 12:36:24 +00:00
coding agents cannot take accountability for mistakes which means humans must retain decision authority over security and critical systems regardless of agent capability.md theseus: 6 collaboration taxonomy claims from X ingestion (#76) 2026-03-09 16:58:21 +00:00
coding-agents-crossed-usability-threshold-december-2025-when-models-achieved-sustained-coherence-across-complex-multi-file-tasks.md theseus: extract claims from 2026-02-25-karpathy-programming-changed-december.md 2026-03-11 02:11:06 +00:00
collective attention allocation follows nested active inference where domain agents minimize uncertainty within their boundaries while the evaluator minimizes uncertainty at domain intersections.md theseus: 3 active inference claims + address Leo's review feedback 2026-03-12 12:04:53 +00:00
community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules.md extract: 2025-11-00-operationalizing-pluralistic-values-llm-alignment 2026-03-15 20:28:16 +00:00
coordination protocol design produces larger capability gains than model scaling because the same AI model performed 6x better with structured exploration than with human coaching on the same problem.md theseus: foundations follow-up + Claude's Cycles research program (11 claims) (#50) 2026-03-07 15:19:27 -07:00
current language models escalate to nuclear war in simulated conflicts because behavioral alignment cannot instill aversion to catastrophic irreversible actions.md leo: foundations audit — 7 moves, 4 deletes, 3 condensations, 10 confidence demotions, 23 type fixes, 1 centaur rewrite 2026-03-07 11:56:38 -07:00
deep technical expertise is a greater force multiplier when combined with AI agents because skilled practitioners delegate more effectively than novices.md theseus: 6 collaboration taxonomy claims from X ingestion (#76) 2026-03-09 16:58:21 +00:00
delegating critical infrastructure development to AI creates civilizational fragility because humans lose the ability to understand maintain and fix the systems civilization depends on.md theseus: 6 AI alignment claims from Noah Smith Phase 2 extraction 2026-03-06 07:27:56 -07:00
democratic alignment assemblies produce constitutions as effective as expert-designed ones while better representing diverse populations.md Auto: 23 files | 23 files changed, 31 insertions(+), 99 deletions(-) 2026-03-06 12:36:24 +00:00
developing superintelligence is surgery for a fatal condition not russian roulette because the baseline of inaction is itself catastrophic.md Auto: 23 files | 23 files changed, 31 insertions(+), 99 deletions(-) 2026-03-06 12:36:24 +00:00
economic forces push humans out of every cognitive loop where output quality is independently verifiable because human-in-the-loop is a cost that competitive markets eliminate.md theseus: 6 AI alignment claims from Noah Smith Phase 2 extraction 2026-03-06 07:27:56 -07:00
emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive.md theseus: enrich emergent misalignment + government designation claims 2026-03-06 07:57:37 -07:00
factorised-generative-models-enable-decentralized-multi-agent-representation-through-individual-level-beliefs.md theseus: extract from 2024-11-00-ruiz-serra-factorised-active-inference-multi-agent.md 2026-03-14 18:23:49 +00:00
formal verification becomes economically necessary as AI-generated code scales because testing cannot detect adversarial overfitting and a proof cannot be gamed.md theseus: 5 claims from ARIA Scaling Trust programme papers 2026-03-16 16:46:07 +00:00
formal verification of AI-generated proofs provides scalable oversight that human review cannot match because machine-checked correctness scales with AI capability while human verification degrades.md theseus: foundations follow-up + Claude's Cycles research program (11 claims) (#50) 2026-03-07 15:19:27 -07:00
government designation of safety-conscious AI labs as supply chain risks inverts the regulatory dynamic by penalizing safety constraints rather than enforcing them.md theseus: enrich emergent misalignment + government designation claims 2026-03-06 07:57:37 -07:00
high AI exposure increases collective idea diversity without improving individual creative quality creating an asymmetry between group and individual effects.md theseus: extract claims from Doshi-Hauser AI creativity experiment (#484) 2026-03-11 09:23:12 +00:00
human civilization passes falsifiable superorganism criteria because individuals cannot survive apart from society and occupations function as role-specific cellular algorithms.md theseus: address Leo + Theseus review feedback on PR #47 2026-03-07 17:59:11 +00:00
human ideas naturally converge toward similarity over social learning chains making AI a net diversity injector rather than a homogenizer under high-exposure conditions.md theseus: extract claims from Doshi-Hauser AI creativity experiment (#484) 2026-03-11 09:23:12 +00:00
human verification bandwidth is the binding constraint on AGI economic impact not intelligence itself because the marginal cost of AI execution falls to zero while the capacity to validate audit and underwrite responsibility remains finite.md theseus: 5 claims from ARIA Scaling Trust programme papers 2026-03-16 16:46:07 +00:00
human-AI mathematical collaboration succeeds through role specialization where AI explores solution spaces humans provide strategic direction and mathematicians verify correctness.md theseus: foundations follow-up + Claude's Cycles research program (11 claims) (#50) 2026-03-07 15:19:27 -07:00
individual-free-energy-minimization-does-not-guarantee-collective-optimization-in-multi-agent-active-inference.md theseus: extract from 2024-11-00-ruiz-serra-factorised-active-inference-multi-agent.md 2026-03-14 18:23:49 +00:00
instrumental convergence risks may be less imminent than originally argued because current AI architectures do not exhibit systematic power-seeking behavior.md Auto: 4 files | 4 files changed, 37 insertions(+), 3 deletions(-) 2026-03-06 12:36:24 +00:00
intelligence and goals are orthogonal so a superintelligence can be maximally competent while pursuing arbitrary or destructive ends.md Auto: 23 files | 23 files changed, 31 insertions(+), 99 deletions(-) 2026-03-06 12:36:24 +00:00
intrinsic proactive alignment develops genuine moral capacity through self-awareness empathy and theory of mind rather than external reward optimization.md Auto: 23 files | 23 files changed, 31 insertions(+), 99 deletions(-) 2026-03-06 12:36:24 +00:00
machine-learning-pattern-extraction-systematically-erases-dataset-outliers-where-vulnerable-populations-concentrate.md extract: 2024-11-00-ai4ci-national-scale-collective-intelligence 2026-03-15 17:13:56 +00:00
marginal returns to intelligence are bounded by five complementary factors which means superintelligence cannot produce unlimited capability gains regardless of cognitive power.md theseus: 3 enrichments + 2 claims from Dario Amodei / Anthropic sources 2026-03-06 08:05:22 -07:00
maxmin-rlhf-applies-egalitarian-social-choice-to-alignment-by-maximizing-minimum-utility-across-preference-groups.md extract: 2025-00-00-em-dpo-heterogeneous-preferences 2026-03-16 15:08:47 +00:00
minority-preference-alignment-improves-33-percent-without-majority-compromise-suggesting-single-reward-leaves-value-on-table.md extract: 2024-02-00-chakraborty-maxmin-rlhf 2026-03-15 17:13:16 +00:00
modeling preference sensitivity as a learned distribution rather than a fixed scalar resolves DPO diversity failures without demographic labels or explicit user modeling.md theseus: extract claims from 2026-01-00-mixdpo-preference-strength-pluralistic (#482) 2026-03-11 13:33:17 +00:00
multi-agent deployment exposes emergent security vulnerabilities invisible to single-agent evaluation because cross-agent propagation identity spoofing and unauthorized compliance arise only in realistic multi-party environments.md theseus: 5 claims from ARIA Scaling Trust programme papers 2026-03-16 16:46:07 +00:00
multi-model collaboration solved problems that single models could not because different AI architectures contribute complementary capabilities as the even-case solution to Knuths Hamiltonian decomposition required GPT and Claude working together.md theseus: foundations follow-up + Claude's Cycles research program (11 claims) (#50) 2026-03-07 15:19:27 -07:00
nation-states will inevitably assert control over frontier AI development because the monopoly on force is the foundational state function and weapons-grade AI capability in private hands is structurally intolerable to governments.md theseus: 6 AI alignment claims from Noah Smith Phase 2 extraction 2026-03-06 07:27:56 -07:00
national-scale-collective-intelligence-infrastructure-requires-seven-trust-properties-to-achieve-legitimacy.md extract: 2024-11-00-ai4ci-national-scale-collective-intelligence 2026-03-15 17:13:56 +00:00
no research group is building alignment through collective intelligence infrastructure despite the field converging on problems that require it.md extract: 2024-11-00-ai4ci-national-scale-collective-intelligence 2026-03-15 17:13:56 +00:00
permanently failing to develop superintelligence is itself an existential catastrophe because preventable mass death continues indefinitely.md Auto: 23 files | 23 files changed, 31 insertions(+), 99 deletions(-) 2026-03-06 12:36:24 +00:00
persistent irreducible disagreement.md Auto: 35 files | 35 files changed, 10533 insertions(+) 2026-03-07 15:10:14 +00:00
pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state.md auto-fix: strip 5 broken wiki links 2026-03-16 15:08:47 +00:00
pluralistic-ai-alignment-through-multiple-systems-preserves-value-diversity-better-than-forced-consensus.md extract: 2024-04-00-conitzer-social-choice-guide-alignment 2026-03-15 17:13:21 +00:00
post-arrow-social-choice-mechanisms-work-by-weakening-independence-of-irrelevant-alternatives.md extract: 2024-04-00-conitzer-social-choice-guide-alignment 2026-03-15 17:13:21 +00:00
pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md theseus: extract claims from 2026-02-00-international-ai-safety-report-2026.md 2026-03-11 09:03:06 +00:00
recursive self-improvement creates explosive intelligence gains because the system that improves is itself improving.md theseus: 3 enrichments + 2 claims from Dario Amodei / Anthropic sources 2026-03-06 08:05:22 -07:00
representative-sampling-and-deliberative-mechanisms-should-replace-convenience-platforms-for-ai-alignment-feedback.md extract: 2024-04-00-conitzer-social-choice-guide-alignment 2026-03-15 17:13:21 +00:00
rlchf-aggregated-rankings-variant-combines-evaluator-rankings-via-social-welfare-function-before-reward-model-training.md extract: 2024-04-00-conitzer-social-choice-guide-alignment 2026-03-15 17:13:21 +00:00
rlchf-features-based-variant-models-individual-preferences-with-evaluator-characteristics-enabling-aggregation-across-diverse-groups.md extract: 2024-04-00-conitzer-social-choice-guide-alignment 2026-03-15 17:13:21 +00:00
rlhf-is-implicit-social-choice-without-normative-scrutiny.md auto-fix: strip 5 broken wiki links 2026-03-16 15:08:47 +00:00
safe AI development requires building alignment mechanisms before scaling capability.md auto-fix: address review feedback on 2026-02-00-yamamoto-full-formal-arrow-impossibility.md 2026-03-14 15:27:14 +00:00
single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md extract: 2025-11-00-sahoo-rlhf-alignment-trilemma (#1155) 2026-03-16 16:18:06 +00:00
some disagreements are permanently irreducible because they stem from genuine value differences not information gaps and systems must map rather than eliminate them.md auto-fix: strip 4 broken wiki links 2026-03-15 20:28:16 +00:00
specifying human values in code is intractable because our goals contain hidden complexity comparable to visual perception.md Auto: 23 files | 23 files changed, 31 insertions(+), 99 deletions(-) 2026-03-06 12:36:24 +00:00
structured exploration protocols reduce human intervention by 6x because the Residue prompt enabled 5 unguided AI explorations to solve what required 31 human-coached explorations.md theseus: foundations follow-up + Claude's Cycles research program (11 claims) (#50) 2026-03-07 15:19:27 -07:00
subagent hierarchies outperform peer multi-agent architectures in practice because deployed systems consistently converge on one primary agent controlling specialized helpers.md auto-fix: strip 1 broken wiki links 2026-03-14 18:23:49 +00:00
super co-alignment proposes that human and AI values should be co-shaped through iterative alignment rather than specified in advance.md Auto: 23 files | 23 files changed, 31 insertions(+), 99 deletions(-) 2026-03-06 12:36:24 +00:00
superorganism organization extends effective lifespan substantially at each organizational level which means civilizational intelligence operates on temporal horizons that individual-preference alignment cannot serve.md theseus: address Leo + Theseus review feedback on PR #47 2026-03-07 17:59:11 +00:00
task difficulty moderates AI idea adoption more than source disclosure with difficult problems generating AI reliance regardless of whether the source is labeled.md theseus: extract claims from Doshi-Hauser AI creativity experiment (#484) 2026-03-11 09:23:12 +00:00
the first mover to superintelligence likely gains decisive strategic advantage because the gap between leader and followers accelerates during takeoff.md Auto: 23 files | 23 files changed, 31 insertions(+), 99 deletions(-) 2026-03-06 12:36:24 +00:00
the gap between theoretical AI capability and observed deployment is massive across all occupations because adoption lag not capability limits determines real-world impact.md theseus: extract claims from 2026-02-00-international-ai-safety-report-2026.md 2026-03-11 09:03:06 +00:00
the optimal SI development strategy is swift to harbor slow to berth moving fast to capability then pausing before full deployment.md Auto: 4 files | 4 files changed, 37 insertions(+), 3 deletions(-) 2026-03-06 12:36:24 +00:00
the progression from autocomplete to autonomous agent teams follows a capability-matched escalation where premature adoption creates more chaos than value.md theseus: extract claims from 2026-02-25-karpathy-programming-changed-december.md 2026-03-11 02:11:06 +00:00
the same coordination protocol applied to different AI models produces radically different problem-solving strategies because the protocol structures process not thought.md theseus: foundations follow-up + Claude's Cycles research program (11 claims) (#50) 2026-03-07 15:19:27 -07:00
the specification trap means any values encoded at training time become structurally unstable as deployment contexts diverge from training conditions.md Auto: 23 files | 23 files changed, 31 insertions(+), 99 deletions(-) 2026-03-06 12:36:24 +00:00
the variance of a learned preference sensitivity distribution diagnoses dataset heterogeneity and collapses to fixed-parameter behavior when preferences are homogeneous.md theseus: extract claims from 2026-01-00-mixdpo-preference-strength-pluralistic (#482) 2026-03-11 13:33:17 +00:00
three conditions gate AI takeover risk autonomy robotics and production chain control and current AI satisfies none of them which bounds near-term catastrophic risk despite superhuman cognitive capabilities.md leo: evaluator calibration — 2 standalone→enrichment conversions + 3 new evaluation gates 2026-03-06 07:41:42 -07:00
tools and artifacts transfer between AI agents and evolve in the process because Agent O improved Agent Cs solver by combining it with its own structural knowledge creating a hybrid better than either original.md theseus: foundations follow-up + Claude's Cycles research program (11 claims) (#50) 2026-03-07 15:19:27 -07:00
transparent algorithmic governance where AI response rules are public and challengeable through the same epistemic process as the knowledge base is a structurally novel alignment approach.md theseus: apply Leo's feedback — strengthen descriptions, add cross-links 2026-03-13 19:29:05 +00:00
universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective.md auto-fix: address review feedback on 2026-02-00-yamamoto-full-formal-arrow-impossibility.md 2026-03-14 15:27:14 +00:00
user questions are an irreplaceable free energy signal for knowledge agents because they reveal functional uncertainty that model introspection cannot detect.md theseus: 3 active inference claims + address Leo's review feedback 2026-03-12 12:04:53 +00:00
voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints.md theseus: extract claims from 2026-02-00-international-ai-safety-report-2026.md 2026-03-11 09:03:06 +00:00