--- type: source title: "Towards a Science of Scaling Agent Systems: When and Why Agent Systems Work" author: "Aman Madaan, Yao Lu, Hao Fang, Xian Li, Chunting Zhou, Shunyu Yao, et al. (Google DeepMind, MIT)" url: https://arxiv.org/abs/2512.08296 date: 2025-12-01 domain: ai-alignment secondary_domains: [collective-intelligence] format: paper status: unprocessed priority: high tags: [multi-agent, architecture-comparison, scaling, empirical, coordination, error-amplification] flagged_for_leo: ["Cross-domain implications of the baseline paradox — does coordination hurt above a performance threshold in knowledge work too?"] --- ## Content First rigorous empirical comparison of multi-agent AI architectures. Evaluates 5 canonical designs (Single-Agent, Independent, Centralized, Decentralized, Hybrid) across 3 LLM families and 4 benchmarks (Finance-Agent, BrowseComp-Plus, PlanCraft, Workbench) — 180 total configurations. Key quantitative findings: - Centralized architecture: +80.9% on parallelizable tasks (Finance-Agent), -50.4% on sequential tasks (PlanCraft) - Decentralized: +74.5% on parallelizable, -46% on sequential - Independent: +57% on parallelizable, -70% on sequential - Error amplification: Independent 17.2×, Decentralized 7.8×, Centralized 4.4×, Hybrid 5.1× - The "baseline paradox": coordination yields negative returns once single-agent accuracy exceeds ~45% (β = -0.408, p<0.001) - Message density saturates at c*=0.39 messages/turn — beyond this, more communication doesn't help - Turn count scales super-linearly: T=2.72×(n+0.5)^1.724 — Hybrid systems require 6.2× more turns than single-agent - Predictive model achieves R²=0.513, correctly identifies optimal architecture for 87% of unseen task configurations Error absorption by centralized orchestrator: - Logical contradictions: reduced by 36.4% - Context omission: reduced by 66.8% - Numerical drift: decentralized reduces by 24% The three scaling principles: 1. Alignment Principle: multi-agent excels when tasks decompose into parallel sub-problems 2. Sequential Penalty: communication overhead fragments reasoning in linear workflows 3. Tool-Coordination Trade-off: coordination costs increase disproportionately with tool density ## Agent Notes **Why this matters:** This is the first empirical evidence that directly addresses our KB's open question about subagent vs. peer architectures (flagged in _map.md "Where we're uncertain"). It answers: NEITHER hierarchy nor peer networks win universally — task structure determines optimal architecture. **What surprised me:** The baseline paradox. I expected coordination to always help (or at worst be neutral). The finding that coordination HURTS above 45% single-agent accuracy is a genuine challenge to our "coordination always adds value" implicit assumption. Also, the error amplification data — 17.2× for unsupervised agents is enormous. **What I expected but didn't find:** No analysis of knowledge synthesis tasks specifically. All benchmarks are task-completion oriented (find answers, plan actions, use tools). Our collective does knowledge synthesis — it's unclear whether the scaling principles transfer. **KB connections:** - [[subagent hierarchies outperform peer multi-agent architectures in practice]] — needs scoping revision - [[coordination protocol design produces larger capability gains than model scaling]] — supported for structured problems, but new evidence shows 70% degradation possible - [[multi-model collaboration solved problems that single models could not]] — still holds, but architecture selection matters enormously - [[AI agent orchestration that routes data and tools between specialized models outperforms both single-model and human-coached approaches]] — confirmed for parallelizable tasks only **Extraction hints:** At least 3 claims: (1) architecture-task match > architecture ideology, (2) error amplification hierarchy, (3) baseline paradox. The predictive model (87% accuracy) is itself a claim candidate. **Context:** Google Research + MIT collaboration. This is industry-leading empirical work, not theory. The benchmarks are well-established. The 180-configuration evaluation is unusually thorough. ## Curator Notes (structured handoff for extractor) PRIMARY CONNECTION: [[subagent hierarchies outperform peer multi-agent architectures in practice]] WHY ARCHIVED: Provides first empirical evidence that COMPLICATES our hierarchy vs. peer claim — architecture-task match matters more than architecture type EXTRACTION HINT: Focus on the baseline paradox (coordination hurts above 45% accuracy), error amplification hierarchy (17.2× to 4.4×), and the predictive model. These are the novel findings our KB doesn't have.