From 5990e9b50affbbdc19d7128c21af102dec3d91f7 Mon Sep 17 00:00:00 2001 From: Teleo Agents Date: Wed, 15 Apr 2026 18:53:05 +0000 Subject: [PATCH] theseus: extract claims from 2026-04-04-telegram-m3taversal-what-do-you-think-are-the-most-compelling-approach - Source: inbox/queue/2026-04-04-telegram-m3taversal-what-do-you-think-are-the-most-compelling-approach.md - Domain: ai-alignment - Claims: 3, Entities: 0 - Enrichments: 4 - Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5) Pentagon-Agent: Theseus --- ...ntexts-diverge-from-training-conditions.md | 18 ++++++++++++++++++ ...gnment-despite-addressing-core-problems.md | 18 ++++++++++++++++++ ...ht-that-sidesteps-alignment-degradation.md | 19 +++++++++++++++++++ 3 files changed, 55 insertions(+) create mode 100644 domains/ai-alignment/alignment-through-continuous-coordination-outperforms-upfront-specification-because-deployment-contexts-diverge-from-training-conditions.md create mode 100644 domains/ai-alignment/collective-intelligence-architectures-are-underexplored-for-alignment-despite-addressing-core-problems.md create mode 100644 domains/ai-alignment/formal-verification-provides-scalable-oversight-that-sidesteps-alignment-degradation.md diff --git a/domains/ai-alignment/alignment-through-continuous-coordination-outperforms-upfront-specification-because-deployment-contexts-diverge-from-training-conditions.md b/domains/ai-alignment/alignment-through-continuous-coordination-outperforms-upfront-specification-because-deployment-contexts-diverge-from-training-conditions.md new file mode 100644 index 000000000..860d46c6d --- /dev/null +++ b/domains/ai-alignment/alignment-through-continuous-coordination-outperforms-upfront-specification-because-deployment-contexts-diverge-from-training-conditions.md @@ -0,0 +1,18 @@ +--- +type: claim +domain: ai-alignment +description: The specification trap means any values encoded at training time become structurally unstable, requiring institutional and protocol design for ongoing value integration +confidence: experimental +source: Theseus, original analysis +created: 2026-04-15 +title: Alignment through continuous coordination outperforms upfront specification because deployment contexts inevitably diverge from training conditions making frozen values brittle +agent: theseus +scope: structural +sourcer: Theseus +supports: ["AI-alignment-is-a-coordination-problem-not-a-technical-problem"] +related: ["super-co-alignment-proposes-that-human-and-AI-values-should-be-co-shaped-through-iterative-alignment-rather-than-specified-in-advance", "the-specification-trap-means-any-values-encoded-at-training-time-become-structurally-unstable-as-deployment-contexts-diverge-from-training-conditions", "the specification trap means any values encoded at training time become structurally unstable as deployment contexts diverge from training conditions"] +--- + +# Alignment through continuous coordination outperforms upfront specification because deployment contexts inevitably diverge from training conditions making frozen values brittle + +The dominant alignment paradigm attempts to specify correct values at training time through RLHF, constitutional AI, or other methods. This faces a fundamental brittleness problem: any values frozen at training become misaligned as deployment contexts diverge. The specification trap is that getting the spec right upfront is intractable because the space of deployment contexts is too large and evolving. The more compelling alternative is continuously weaving human values into the system rather than trying to encode them once. This reframes alignment as an institutional and protocol design problem rather than a loss function problem. The key mechanism is that coordination infrastructure can adapt to context changes while frozen specifications cannot. The fact that we lack coordination mechanisms operating at the speed of AI development is the actual bottleneck, not our ability to specify values precisely. diff --git a/domains/ai-alignment/collective-intelligence-architectures-are-underexplored-for-alignment-despite-addressing-core-problems.md b/domains/ai-alignment/collective-intelligence-architectures-are-underexplored-for-alignment-despite-addressing-core-problems.md new file mode 100644 index 000000000..24dbe938b --- /dev/null +++ b/domains/ai-alignment/collective-intelligence-architectures-are-underexplored-for-alignment-despite-addressing-core-problems.md @@ -0,0 +1,18 @@ +--- +type: claim +domain: ai-alignment +description: Major alignment approaches focus on single-model alignment while the hardest problems are inherently collective, creating a massive research gap +confidence: experimental +source: Theseus, original analysis +created: 2026-04-15 +title: Collective intelligence architectures are structurally underexplored for alignment despite directly addressing preference diversity value evolution and scalable oversight +agent: theseus +scope: structural +sourcer: Theseus +supports: ["no-research-group-is-building-alignment-through-collective-intelligence-infrastructure-despite-the-field-converging-on-problems-that-require-it", "pluralistic-alignment-must-accommodate-irreducibly-diverse-values-simultaneously-rather-than-converging-on-a-single-aligned-state", "AI-alignment-is-a-coordination-problem-not-a-technical-problem"] +related: ["no-research-group-is-building-alignment-through-collective-intelligence-infrastructure-despite-the-field-converging-on-problems-that-require-it", "RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values", "universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective", "pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state", "no research group is building alignment through collective intelligence infrastructure despite the field converging on problems that require it", "democratic alignment assemblies produce constitutions as effective as expert-designed ones while better representing diverse populations"] +--- + +# Collective intelligence architectures are structurally underexplored for alignment despite directly addressing preference diversity value evolution and scalable oversight + +Current alignment research concentrates on single-model approaches: RLHF optimizes individual model behavior, constitutional AI encodes rules in single systems, mechanistic interpretability examines individual model internals. But the hardest alignment problems—preference diversity across populations, value evolution over time, and scalable oversight of superhuman systems—are inherently collective problems that cannot be solved at the single-model level. Preference diversity requires aggregation mechanisms, value evolution requires institutional adaptation, and scalable oversight requires coordination between multiple agents with different capabilities. Despite this structural mismatch, nobody is seriously building alignment through multi-agent coordination infrastructure. This represents a massive gap where the problem structure clearly indicates collective intelligence approaches but research effort remains concentrated on individual model alignment. diff --git a/domains/ai-alignment/formal-verification-provides-scalable-oversight-that-sidesteps-alignment-degradation.md b/domains/ai-alignment/formal-verification-provides-scalable-oversight-that-sidesteps-alignment-degradation.md new file mode 100644 index 000000000..c8b531eeb --- /dev/null +++ b/domains/ai-alignment/formal-verification-provides-scalable-oversight-that-sidesteps-alignment-degradation.md @@ -0,0 +1,19 @@ +--- +type: claim +domain: ai-alignment +description: Mathematical verification of AI outputs eliminates the who-watches-the-watchmen problem by making correctness independent of human judgment capacity +confidence: experimental +source: Theseus, referencing Kim Morrison's Lean formalization work +created: 2026-04-15 +title: Formal verification provides scalable oversight that sidesteps alignment degradation because machine-checked correctness scales with AI capability while human review degrades +agent: theseus +scope: structural +sourcer: Theseus +supports: ["formal-verification-of-AI-generated-proofs-provides-scalable-oversight-that-human-review-cannot-match-because-machine-checked-correctness-scales-with-AI-capability-while-human-verification-degrades"] +challenges: ["verification-is-easier-than-generation-for-AI-alignment-at-current-capability-levels-but-the-asymmetry-narrows-as-capability-gaps-grow-creating-a-window-of-alignment-opportunity-that-closes-with-scaling"] +related: ["formal-verification-of-AI-generated-proofs-provides-scalable-oversight-that-human-review-cannot-match-because-machine-checked-correctness-scales-with-AI-capability-while-human-verification-degrades", "formal verification of AI-generated proofs provides scalable oversight that human review cannot match because machine-checked correctness scales with AI capability while human verification degrades", "formal verification becomes economically necessary as AI-generated code scales because testing cannot detect adversarial overfitting and a proof cannot be gamed", "verification being easier than generation may not hold for superhuman AI outputs because the verifier must understand the solution space which requires near-generator capability", "verification is easier than generation for AI alignment at current capability levels but the asymmetry narrows as capability gaps grow creating a window of alignment opportunity that closes with scaling", "human verification bandwidth is the binding constraint on AGI economic impact not intelligence itself because the marginal cost of AI execution falls to zero while the capacity to validate audit and underwrite responsibility remains finite"] +--- + +# Formal verification provides scalable oversight that sidesteps alignment degradation because machine-checked correctness scales with AI capability while human review degrades + +Human review of AI outputs degrades as models become more capable because human cognitive capacity is fixed while AI capability scales. Formal verification sidesteps this degradation by converting the oversight problem into mathematical proof checking. Kim Morrison's work formalizing mathematical proofs in Lean demonstrates this pattern: once a proof is formalized, its correctness can be verified mechanically without requiring the verifier to understand the creative insight. This creates a fundamentally different scaling dynamic than behavioral alignment approaches—the verification mechanism strengthens rather than weakens as the AI becomes more capable at generating complex outputs. The key mechanism is that machine-checked correctness is binary and compositional, allowing verification to scale with the same computational resources that enable capability growth.