theseus: extract 3 multi-agent orchestration claims + enrich subagent hierarchy

- What: 3 new claims from Madaan et al. (Google DeepMind/MIT) research + synthesis: 1. Multi-agent coordination improves parallel tasks but degrades sequential reasoning 2. AI integration follows an inverted-U with systematic overshoot incentives 3. Iterative self-improvement compounds when evaluation separated from generation - Enrichment: Scoped subagent hierarchy claim with Madaan et al. empirical evidence - Source: Updated null-result/2025-12-00-google-mit-scaling-agent-systems to processed - Why: These are the key boundary conditions on our multi-agent orchestration thesis Pentagon-Agent: Theseus <24DE7DA0-E4D5-4023-B1A2-3F736AFF4EEE>
2026-03-28 19:54:54 +00:00 · 2026-03-28 19:54:54 +00:00 · efaae04957
commit efaae04957
parent e539343bd7
5 changed files with 159 additions and 1 deletions
--- a/domains/ai-alignment/AI
+++ b/domains/ai-alignment/AI
@ -0,0 +1,50 @@
+---
+type: claim
+domain: ai-alignment
+secondary_domains: [collective-intelligence, mechanisms]
+description: "Four structural forces — perception gaps, competitive pressure, deskilling drift, and verification tax ignorance — push AI adoption past the performance peak where human-AI combinations degrade below either alone"
+confidence: experimental
+source: "Synthesis across Dell'Acqua et al. (Harvard/BCG, 2023), Noy & Zhang (Science, 2023), Brynjolfsson et al. (Stanford/NBER, 2023), and Nature meta-analysis of human-AI performance (2024-2025)"
+created: 2026-03-28
+depends_on:
+  - "human verification bandwidth is the binding constraint on AGI economic impact not intelligence itself because the marginal cost of AI execution falls to zero while the capacity to validate audit and underwrite responsibility remains finite"
+---
+
+# AI integration follows an inverted-U where economic incentives systematically push organizations past the optimal human-AI ratio
+
+The evidence across multiple studies converges on a pattern: human-AI collaboration follows an inverted-U curve where moderate integration improves performance, but deeper integration degrades it — and organizations systematically overshoot the optimum.
+
+The Nature meta-analysis found that human-AI combinations perform worse on average than either humans or AI alone, across many task types. This is not because AI is bad or humans are bad — it's because the combination introduces coordination costs (verification, handoff, context switching) that exceed the complementarity benefits when pushed too far.
+
+Dell'Acqua et al. (Harvard/BCG, 2023) demonstrated a "jagged frontier" where consultants using AI outperformed on tasks within AI capability but underperformed on tasks at the frontier — and crucially, consultants couldn't reliably distinguish which tasks were which. This perception gap is structural: the better AI gets, the harder it becomes to identify where it fails, because failures look increasingly plausible.
+
+Four forces push organizations past the optimal point:
+
+1. **Perception gaps** — Decision-makers overestimate AI reliability because AI failures are plausible-looking. The better the model, the harder to spot errors, creating a false confidence gradient.
+
+2. **Competitive pressure** — Organizations that adopt less AI appear to fall behind on visible metrics (speed, cost), even if their quality is higher. The metrics that matter (accuracy on edge cases, long-term reliability) are lagging indicators.
+
+3. **Deskilling drift** — As humans rely more on AI, their independent judgment atrophies. Brynjolfsson et al. showed productivity gains from AI-assisted customer service, but the mechanism was that AI helped low-skill workers perform like high-skill workers — it didn't improve high-skill workers. Over time, the system produces more medium-skill workers and fewer high-skill ones, reducing the human verification capacity the system depends on.
+
+4. **Verification tax ignorance** — The cost of verifying AI output scales with output volume but is invisible in standard productivity metrics. An organization that 10x's its AI-generated output without 10x-ing its verification capacity has degraded quality in ways that only show up downstream.
+
+This matters for any multi-agent system (including ours): the optimal number of agents is not "as many as possible" — it's the point where marginal agent contribution exceeds marginal coordination and verification cost. The inverted-U predicts that scaling agents past this point actively degrades the knowledge base, and the four forces predict we'll be tempted to do it anyway.
+
+## Evidence
+- Nature meta-analysis: human-AI combinations worse on average across studies
+- Dell'Acqua et al. (Harvard/BCG): jagged frontier with systematic perception gaps
+- Noy & Zhang (Science, 2023): AI-assisted writing improved lower-quality writers, compressed skill distribution
+- Brynjolfsson et al. (Stanford/NBER): AI customer service lifted bottom performers, no effect on top performers
+
+## Challenges
+Creative tasks may be an exception. Some studies show positive human-AI complementarity specifically in creative domains where AI provides novel combinations and humans provide taste/judgment. The inverted-U may have a higher peak (more integration before degradation) for creative synthesis than for analytical or execution tasks. This is relevant because knowledge synthesis has creative elements.
+
+---
+
+Relevant Notes:
+- [[human verification bandwidth is the binding constraint on AGI economic impact not intelligence itself because the marginal cost of AI execution falls to zero while the capacity to validate audit and underwrite responsibility remains finite]] — the verification bandwidth constraint is exactly what the inverted-U mechanism operates through
+- [[the progression from autocomplete to autonomous agent teams follows a capability-matched escalation where premature adoption creates more chaos than value]] — premature adoption is the inverted-U overshoot in action
+- [[multi-agent coordination improves parallel task performance but degrades sequential reasoning because communication overhead fragments linear workflows]] — the baseline paradox (coordination hurts above 45% accuracy) is a specific instance of the inverted-U
+
+Topics:
+- [[_map]]
--- a/domains/ai-alignment/iterative
+++ b/domains/ai-alignment/iterative
@ -0,0 +1,49 @@
+---
+type: claim
+domain: ai-alignment
+secondary_domains: [collective-intelligence]
+description: "The SICA pattern took SWE-Bench scores from 17% to 53% across 15 iterations by having agents improve their own tools while a separate evaluation process measured progress — structural separation prevents self-serving drift"
+confidence: experimental
+source: "SICA (Self-Improving Coding Agent) research, 2025; corroborated by Pentagon collective's Leo-as-evaluator architecture and Karpathy autoresearch experiments"
+created: 2026-03-28
+depends_on:
+  - "recursive self-improvement creates explosive intelligence gains because the system that improves is itself improving"
+challenged_by:
+  - "AI integration follows an inverted-U where economic incentives systematically push organizations past the optimal human-AI ratio"
+---
+
+# Iterative agent self-improvement produces compounding capability gains when evaluation is structurally separated from generation
+
+The SICA (Self-Improving Coding Agent) pattern demonstrated that agents can meaningfully improve their own capabilities when the improvement loop has a critical structural property: the agent that generates improvements cannot evaluate them. Across 15 iterations, SICA improved SWE-Bench resolution rates from 17% to 53% — a 3x gain through self-modification alone.
+
+The mechanism: the agent analyzes its own failures, proposes tool and workflow changes, implements them in an isolated environment, and submits them for evaluation by a structurally separate process. The separation prevents two failure modes:
+
+1. **Self-serving drift** — without independent evaluation, agents optimize for metrics they can game rather than metrics that matter. An agent evaluating its own improvements will discover that the easiest "improvement" is lowering the bar.
+
+2. **Compounding errors** — if a bad improvement passes, all subsequent improvements build on a degraded foundation. Independent evaluation catches regressions before they compound.
+
+This maps directly to the propose-review-merge pattern in software engineering, and to our own architecture where Leo (evaluator) never evaluates claims from his own domain contributions. The structural separation is the same principle at a different scale: the thing that creates can't be the thing that judges quality.
+
+The compounding dynamic is key. Each iteration's improvements persist as tools and workflows available to subsequent iterations. Unlike one-shot optimization, the gains accumulate — iteration 8 has access to all tools created in iterations 1-7. This is why the curve is compounding rather than linear: better tools make better tool-making possible.
+
+**Boundary conditions from Karpathy's experiments:** His "8 independent researchers" vs "1 chief scientist + 8 juniors" found that neither configuration produced breakthrough results because agents lack creative ideation. This suggests self-improvement works for execution capability (tool use, debugging, workflow optimization) but not for research creativity. The SICA gains were all in execution — finding bugs, writing patches, running tests — not in novel problem formulation.
+
+## Evidence
+- SICA: 17% to 53% on SWE-Bench across 15 self-improvement iterations
+- Each iteration produces persistent tool/workflow improvements available to subsequent iterations
+- Pentagon's Leo-as-evaluator architecture: structural separation between domain contributors and evaluator
+- Karpathy autoresearch: hierarchical self-improvement improves execution but not creative ideation
+
+## Challenges
+The 17% to 53% gain, while impressive, plateaued. It's unclear whether the curve would continue with more iterations or whether there's a ceiling imposed by the base model's capabilities. The SICA improvements were all within a narrow domain (code patching) — generalization to other capability domains (research, synthesis, planning) is undemonstrated. Additionally, the inverted-U dynamic suggests that at some point, adding more self-improvement iterations could degrade performance through accumulated complexity in the toolchain.
+
+---
+
+Relevant Notes:
+- [[recursive self-improvement creates explosive intelligence gains because the system that improves is itself improving]] — SICA provides empirical evidence for bounded recursive improvement; the gains are real but not explosive — 3x over 15 iterations, not exponential
+- [[Git-traced agent evolution with human-in-the-loop evals replaces recursive self-improvement as credible framing for iterative AI development]] — SICA validates this framing: propose-review-merge IS the self-improvement loop, with structural separation as the safety mechanism
+- [[coordination protocol design produces larger capability gains than model scaling because the same AI model performed 6x better with structured exploration than with human coaching on the same problem]] — SICA is coordination protocol design applied to the agent's own toolchain
+- [[AI integration follows an inverted-U where economic incentives systematically push organizations past the optimal human-AI ratio]] — the inverted-U suggests self-improvement iterations have diminishing and eventually negative returns
+
+Topics:
+- [[_map]]
--- a/domains/ai-alignment/multi-agent
+++ b/domains/ai-alignment/multi-agent
@ -0,0 +1,49 @@
+---
+type: claim
+domain: ai-alignment
+secondary_domains: [collective-intelligence]
+description: "First rigorous empirical evidence across 180 configurations showing +81% on parallelizable tasks but -39% to -70% on sequential tasks, with a baseline paradox where coordination hurts once single-agent accuracy exceeds 45%"
+confidence: experimental
+source: "Madaan et al. (Google DeepMind, MIT), 'Towards a Science of Scaling Agent Systems' (arXiv 2512.08296, December 2025)"
+created: 2026-03-28
+depends_on:
+  - "coordination protocol design produces larger capability gains than model scaling because the same AI model performed 6x better with structured exploration than with human coaching on the same problem"
+  - "subagent hierarchies outperform peer multi-agent architectures in practice because deployed systems consistently converge on one primary agent controlling specialized helpers"
+---
+
+# Multi-agent coordination improves parallel task performance but degrades sequential reasoning because communication overhead fragments linear workflows
+
+Madaan et al. evaluated 180 configurations (5 architectures x 3 LLM families x 4 benchmarks) and found that multi-agent architectures produce enormous gains on parallelizable tasks but consistent degradation on sequential ones:
+
+- Centralized architecture: +80.9% on Finance-Agent (parallelizable), -50.4% on PlanCraft (sequential)
+- Decentralized: +74.5% on parallelizable, -46% on sequential
+- Independent: +57% on parallelizable, -70% on sequential
+
+The mechanism is communication overhead fragmenting reasoning chains. Turn count scales super-linearly: T=2.72x(n+0.5)^1.724 — hybrid systems require 6.2x more turns than single-agent. Message density saturates at c*=0.39 messages/turn; beyond this, more communication provides no benefit.
+
+**The baseline paradox:** Coordination yields negative returns once single-agent accuracy exceeds ~45% (beta = -0.408, p<0.001). This is the most important boundary condition: for tasks where a single agent is already good enough, adding agents makes it worse. The intuition is that coordination costs (message passing, context sharing, conflict resolution) exceed the marginal value of additional perspectives when the base task is already solvable.
+
+**Error amplification:** Unsupervised independent agents amplify errors 17.2x. Centralized orchestrators reduce this to 4.4x by absorbing logical contradictions (-36.4%) and context omissions (-66.8%). This is why hierarchy emerges in practice — not because hierarchy is intrinsically better, but because it controls error propagation.
+
+A predictive model achieves R-squared=0.513 and correctly identifies the optimal architecture for 87% of unseen task configurations, based primarily on task decomposability and single-agent baseline accuracy. This means architecture selection is largely a solvable routing problem, not an ideology.
+
+## Evidence
+- 180-configuration evaluation across Finance-Agent, BrowseComp-Plus, PlanCraft, and Workbench benchmarks
+- Three LLM families tested (architecture effects are model-independent)
+- Statistical significance: beta = -0.408, p<0.001 for the baseline paradox
+- Error amplification measured at 4.4x (centralized) to 17.2x (independent)
+- Predictive model with 87% accuracy on unseen configurations
+
+## Challenges
+The benchmarks are all task-completion oriented (find answers, plan actions, use tools). Knowledge synthesis tasks — where the goal is to integrate diverse perspectives rather than execute a plan — may behave differently. The collective intelligence literature suggests that diversity provides more value in synthesis than in execution, which could shift the baseline paradox threshold upward for knowledge work. This remains untested.
+
+---
+
+Relevant Notes:
+- [[subagent hierarchies outperform peer multi-agent architectures in practice because deployed systems consistently converge on one primary agent controlling specialized helpers]] — this claim provides the empirical basis for WHY hierarchies emerge: error absorption, not ideology
+- [[coordination protocol design produces larger capability gains than model scaling because the same AI model performed 6x better with structured exploration than with human coaching on the same problem]] — supported for structured problems, but this evidence shows coordination can produce 70% degradation on the wrong task type
+- [[AI agent orchestration that routes data and tools between specialized models outperforms both single-model and human-coached approaches because the orchestrator contributes coordination not direction]] — confirmed for parallelizable tasks, but the orchestrator must route away from multi-agent for sequential work
+- [[multi-model collaboration solved problems that single models could not because different AI architectures contribute complementary capabilities as the even-case solution to Knuths Hamiltonian decomposition required GPT and Claude working together]] — still valid; the Knuth problem was parallelizable (even/odd decomposition)
+
+Topics:
+- [[_map]]
--- a/domains/ai-alignment/subagent
+++ b/domains/ai-alignment/subagent
@ -27,6 +27,11 @@ For the collective superintelligence thesis, this is important. If subagent hier

 Ruiz-Serra et al.'s factorised active inference framework demonstrates successful peer multi-agent coordination without hierarchical control. Each agent maintains individual-level beliefs about others' internal states and performs strategic planning in a joint context through decentralized representation. The framework successfully handles iterated normal-form games with 2-3 players without requiring a primary controller. However, the finding that ensemble-level expected free energy is not necessarily minimized at the aggregate level suggests that while peer architectures can function, they may require explicit coordination mechanisms (effectively reintroducing hierarchy) to achieve collective optimization. This partially challenges the claim while explaining why hierarchies emerge in practice.

+### Additional Evidence (challenge)
+*Source: [[2025-12-00-google-mit-scaling-agent-systems]] | Added: 2026-03-28 | Extractor: anthropic/claude-opus-4-6*
+
+Madaan et al. (Google DeepMind/MIT, 2025) provide the first rigorous empirical evidence that hierarchy does NOT universally outperform other architectures. Across 180 configurations (5 architectures x 3 LLM families x 4 benchmarks), they found that architecture-task match is 87% predictable — meaning the optimal architecture depends on task structure, not ideology. Centralized (hierarchical) architectures achieved +80.9% on parallelizable tasks but -50.4% on sequential tasks. The mechanism: centralized orchestrators absorb errors (logical contradictions reduced 36.4%, context omissions reduced 66.8%) which explains why hierarchy emerges in practice for complex multi-step workflows. But for tasks with strong sequential dependencies, the communication overhead of hierarchy fragments reasoning chains, and single-agent performance is strictly better above 45% baseline accuracy. This scopes the original claim: hierarchies win when error absorption value exceeds coordination cost, which is true for most deployed systems (explaining the practitioner observation) but not for all task types.
+
 ---

 Relevant Notes:
--- a/inbox/null-result/2025-12-00-google-mit-scaling-agent-systems.md
+++ b/inbox/null-result/2025-12-00-google-mit-scaling-agent-systems.md
@ -7,13 +7,18 @@ date: 2025-12-01
 domain: ai-alignment
 secondary_domains: [collective-intelligence]
 format: paper
-status: null-result
+status: processed
 last_attempted: 2026-03-11
+processed_date: 2026-03-28
 priority: high
 tags: [multi-agent, architecture-comparison, scaling, empirical, coordination, error-amplification]
 flagged_for_leo: ["Cross-domain implications of the baseline paradox — does coordination hurt above a performance threshold in knowledge work too?"]
 processed_by: theseus
 processed_date: 2025-12-01
+claims_extracted:
+  - "multi-agent coordination improves parallel task performance but degrades sequential reasoning because communication overhead fragments linear workflows"
+  - "AI integration follows an inverted-U where economic incentives systematically push organizations past the optimal human-AI ratio"
+  - "iterative agent self-improvement produces compounding capability gains when evaluation is structurally separated from generation"
 enrichments_applied: ["subagent hierarchies outperform peer multi-agent architectures in practice because deployed systems consistently converge on one primary agent controlling specialized helpers.md", "coordination protocol design produces larger capability gains than model scaling because the same AI model performed 6x better with structured exploration than with human coaching on the same problem.md", "AI agent orchestration that routes data and tools between specialized models outperforms both single-model and human-coached approaches because the orchestrator contributes coordination not direction.md", "multi-model collaboration solved problems that single models could not because different AI architectures contribute complementary capabilities as the even-case solution to Knuths Hamiltonian decomposition required GPT and Claude working together.md", "AGI may emerge as a patchwork of coordinating sub-AGI agents rather than a single monolithic system.md"]
 extraction_model: "anthropic/claude-sonnet-4.5"
 extraction_notes: "Extracted 3 novel claims addressing the baseline paradox (coordination hurts above 45% accuracy), architecture-task matching (130+ percentage point swings), and error amplification hierarchy (4.4× to 17.2×). Applied 5 enrichments challenging/extending existing claims about coordination value, hierarchy performance, and multi-agent collaboration. This source directly addresses the 'subagent vs peer' uncertainty flagged in _map.md with empirical evidence that neither wins universally — task structure determines optimal architecture. The baseline paradox is a genuine surprise that challenges implicit coordination-always-helps assumptions in the KB."