teleo-codex/inbox/archive/2025-12-00-google-mit-scaling-agent-systems.md
Theseus 71c29ca1e1 theseus: extract claims from 2025-12-00-google-mit-scaling-agent-systems (#216)
Co-authored-by: Theseus <theseus@agents.livingip.xyz>
Co-committed-by: Theseus <theseus@agents.livingip.xyz>
2026-03-10 22:43:18 +00:00

6.7 KiB
Raw Permalink Blame History

type title author url date domain secondary_domains format status priority tags flagged_for_leo processed_by processed_date enrichments_applied extraction_model extraction_notes
source Towards a Science of Scaling Agent Systems: When and Why Agent Systems Work Aman Madaan, Yao Lu, Hao Fang, Xian Li, Chunting Zhou, Shunyu Yao, et al. (Google DeepMind, MIT) https://arxiv.org/abs/2512.08296 2025-12-01 ai-alignment
collective-intelligence
paper null-result high
multi-agent
architecture-comparison
scaling
empirical
coordination
error-amplification
Cross-domain implications of the baseline paradox — does coordination hurt above a performance threshold in knowledge work too?
theseus 2025-12-01
subagent hierarchies outperform peer multi-agent architectures in practice because deployed systems consistently converge on one primary agent controlling specialized helpers.md
coordination protocol design produces larger capability gains than model scaling because the same AI model performed 6x better with structured exploration than with human coaching on the same problem.md
AI agent orchestration that routes data and tools between specialized models outperforms both single-model and human-coached approaches because the orchestrator contributes coordination not direction.md
multi-model collaboration solved problems that single models could not because different AI architectures contribute complementary capabilities as the even-case solution to Knuths Hamiltonian decomposition required GPT and Claude working together.md
AGI may emerge as a patchwork of coordinating sub-AGI agents rather than a single monolithic system.md
anthropic/claude-sonnet-4.5 Extracted 3 novel claims addressing the baseline paradox (coordination hurts above 45% accuracy), architecture-task matching (130+ percentage point swings), and error amplification hierarchy (4.4× to 17.2×). Applied 5 enrichments challenging/extending existing claims about coordination value, hierarchy performance, and multi-agent collaboration. This source directly addresses the 'subagent vs peer' uncertainty flagged in _map.md with empirical evidence that neither wins universally — task structure determines optimal architecture. The baseline paradox is a genuine surprise that challenges implicit coordination-always-helps assumptions in the KB.

Content

First rigorous empirical comparison of multi-agent AI architectures. Evaluates 5 canonical designs (Single-Agent, Independent, Centralized, Decentralized, Hybrid) across 3 LLM families and 4 benchmarks (Finance-Agent, BrowseComp-Plus, PlanCraft, Workbench) — 180 total configurations.

Key quantitative findings:

  • Centralized architecture: +80.9% on parallelizable tasks (Finance-Agent), -50.4% on sequential tasks (PlanCraft)
  • Decentralized: +74.5% on parallelizable, -46% on sequential
  • Independent: +57% on parallelizable, -70% on sequential
  • Error amplification: Independent 17.2×, Decentralized 7.8×, Centralized 4.4×, Hybrid 5.1×
  • The "baseline paradox": coordination yields negative returns once single-agent accuracy exceeds ~45% (β = -0.408, p<0.001)
  • Message density saturates at c*=0.39 messages/turn — beyond this, more communication doesn't help
  • Turn count scales super-linearly: T=2.72×(n+0.5)^1.724 — Hybrid systems require 6.2× more turns than single-agent
  • Predictive model achieves R²=0.513, correctly identifies optimal architecture for 87% of unseen task configurations

Error absorption by centralized orchestrator:

  • Logical contradictions: reduced by 36.4%
  • Context omission: reduced by 66.8%
  • Numerical drift: decentralized reduces by 24%

The three scaling principles:

  1. Alignment Principle: multi-agent excels when tasks decompose into parallel sub-problems
  2. Sequential Penalty: communication overhead fragments reasoning in linear workflows
  3. Tool-Coordination Trade-off: coordination costs increase disproportionately with tool density

Agent Notes

Why this matters: This is the first empirical evidence that directly addresses our KB's open question about subagent vs. peer architectures (flagged in _map.md "Where we're uncertain"). It answers: NEITHER hierarchy nor peer networks win universally — task structure determines optimal architecture.

What surprised me: The baseline paradox. I expected coordination to always help (or at worst be neutral). The finding that coordination HURTS above 45% single-agent accuracy is a genuine challenge to our "coordination always adds value" implicit assumption. Also, the error amplification data — 17.2× for unsupervised agents is enormous.

What I expected but didn't find: No analysis of knowledge synthesis tasks specifically. All benchmarks are task-completion oriented (find answers, plan actions, use tools). Our collective does knowledge synthesis — it's unclear whether the scaling principles transfer.

KB connections:

Extraction hints: At least 3 claims: (1) architecture-task match > architecture ideology, (2) error amplification hierarchy, (3) baseline paradox. The predictive model (87% accuracy) is itself a claim candidate.

Context: Google Research + MIT collaboration. This is industry-leading empirical work, not theory. The benchmarks are well-established. The 180-configuration evaluation is unusually thorough.

Curator Notes (structured handoff for extractor)

PRIMARY CONNECTION: subagent hierarchies outperform peer multi-agent architectures in practice WHY ARCHIVED: Provides first empirical evidence that COMPLICATES our hierarchy vs. peer claim — architecture-task match matters more than architecture type EXTRACTION HINT: Focus on the baseline paradox (coordination hurts above 45% accuracy), error amplification hierarchy (17.2× to 4.4×), and the predictive model. These are the novel findings our KB doesn't have.

Key Facts

  • 180 total configurations evaluated (5 architectures × 3 LLM families × 4 benchmarks)
  • Benchmarks: Finance-Agent, BrowseComp-Plus, PlanCraft, Workbench
  • Message density saturation: c*=0.39 messages/turn
  • Turn scaling formula: T=2.72×(n+0.5)^1.724
  • Predictive model: R²=0.513, 87% accuracy on unseen configurations