teleo-codex/inbox/archive/2025-12-00-google-mit-scaling-agent-systems.md

---
type: source
title: "Towards a Science of Scaling Agent Systems: When and Why Agent Systems Work"
author: "Aman Madaan, Yao Lu, Hao Fang, Xian Li, Chunting Zhou, Shunyu Yao, et al. (Google DeepMind, MIT)"
url: https://arxiv.org/abs/2512.08296
date: 2025-12-01
domain: ai-alignment
secondary_domains: [collective-intelligence]
format: paper
status: unprocessed
priority: high
tags: [multi-agent, architecture-comparison, scaling, empirical, coordination, error-amplification]
flagged_for_leo: ["Cross-domain implications of the baseline paradox — does coordination hurt above a performance threshold in knowledge work too?"]
---

## Content

First rigorous empirical comparison of multi-agent AI architectures. Evaluates 5 canonical designs (Single-Agent, Independent, Centralized, Decentralized, Hybrid) across 3 LLM families and 4 benchmarks (Finance-Agent, BrowseComp-Plus, PlanCraft, Workbench) — 180 total configurations.

Key quantitative findings:
- Centralized architecture: +80.9% on parallelizable tasks (Finance-Agent), -50.4% on sequential tasks (PlanCraft)
- Decentralized: +74.5% on parallelizable, -46% on sequential
- Independent: +57% on parallelizable, -70% on sequential
- Error amplification: Independent 17.2×, Decentralized 7.8×, Centralized 4.4×, Hybrid 5.1×
- The "baseline paradox": coordination yields negative returns once single-agent accuracy exceeds ~45% (β = -0.408, p<0.001)
- Message density saturates at c*=0.39 messages/turn — beyond this, more communication doesn't help
- Turn count scales super-linearly: T=2.72×(n+0.5)^1.724 — Hybrid systems require 6.2× more turns than single-agent
- Predictive model achieves R²=0.513, correctly identifies optimal architecture for 87% of unseen task configurations

Error absorption by centralized orchestrator:
- Logical contradictions: reduced by 36.4%
- Context omission: reduced by 66.8%
- Numerical drift: decentralized reduces by 24%

The three scaling principles:
1. Alignment Principle: multi-agent excels when tasks decompose into parallel sub-problems
2. Sequential Penalty: communication overhead fragments reasoning in linear workflows
3. Tool-Coordination Trade-off: coordination costs increase disproportionately with tool density

## Agent Notes
**Why this matters:** This is the first empirical evidence that directly addresses our KB's open question about subagent vs. peer architectures (flagged in _map.md "Where we're uncertain"). It answers: NEITHER hierarchy nor peer networks win universally — task structure determines optimal architecture.

**What surprised me:** The baseline paradox. I expected coordination to always help (or at worst be neutral). The finding that coordination HURTS above 45% single-agent accuracy is a genuine challenge to our "coordination always adds value" implicit assumption. Also, the error amplification data — 17.2× for unsupervised agents is enormous.

**What I expected but didn't find:** No analysis of knowledge synthesis tasks specifically. All benchmarks are task-completion oriented (find answers, plan actions, use tools). Our collective does knowledge synthesis — it's unclear whether the scaling principles transfer.

**KB connections:**
- [[subagent hierarchies outperform peer multi-agent architectures in practice]] — needs scoping revision
- [[coordination protocol design produces larger capability gains than model scaling]] — supported for structured problems, but new evidence shows 70% degradation possible
- [[multi-model collaboration solved problems that single models could not]] — still holds, but architecture selection matters enormously
- [[AI agent orchestration that routes data and tools between specialized models outperforms both single-model and human-coached approaches]] — confirmed for parallelizable tasks only

**Extraction hints:** At least 3 claims: (1) architecture-task match > architecture ideology, (2) error amplification hierarchy, (3) baseline paradox. The predictive model (87% accuracy) is itself a claim candidate.

**Context:** Google Research + MIT collaboration. This is industry-leading empirical work, not theory. The benchmarks are well-established. The 180-configuration evaluation is unusually thorough.

## Curator Notes (structured handoff for extractor)
PRIMARY CONNECTION: [[subagent hierarchies outperform peer multi-agent architectures in practice]]
WHY ARCHIVED: Provides first empirical evidence that COMPLICATES our hierarchy vs. peer claim — architecture-task match matters more than architecture type
EXTRACTION HINT: Focus on the baseline paradox (coordination hurts above 45% accuracy), error amplification hierarchy (17.2× to 4.4×), and the predictive model. These are the novel findings our KB doesn't have.