teleo-codex/inbox/archive/2025-12-00-google-mit-scaling-agent-systems.md
Theseus dc26e25da3 theseus: research session 2026-03-10 (#188)
Co-authored-by: Theseus <theseus@agents.livingip.xyz>
Co-committed-by: Theseus <theseus@agents.livingip.xyz>
2026-03-10 20:05:52 +00:00

4.6 KiB
Raw Blame History

type title author url date domain secondary_domains format status priority tags flagged_for_leo
source Towards a Science of Scaling Agent Systems: When and Why Agent Systems Work Aman Madaan, Yao Lu, Hao Fang, Xian Li, Chunting Zhou, Shunyu Yao, et al. (Google DeepMind, MIT) https://arxiv.org/abs/2512.08296 2025-12-01 ai-alignment
collective-intelligence
paper unprocessed high
multi-agent
architecture-comparison
scaling
empirical
coordination
error-amplification
Cross-domain implications of the baseline paradox — does coordination hurt above a performance threshold in knowledge work too?

Content

First rigorous empirical comparison of multi-agent AI architectures. Evaluates 5 canonical designs (Single-Agent, Independent, Centralized, Decentralized, Hybrid) across 3 LLM families and 4 benchmarks (Finance-Agent, BrowseComp-Plus, PlanCraft, Workbench) — 180 total configurations.

Key quantitative findings:

  • Centralized architecture: +80.9% on parallelizable tasks (Finance-Agent), -50.4% on sequential tasks (PlanCraft)
  • Decentralized: +74.5% on parallelizable, -46% on sequential
  • Independent: +57% on parallelizable, -70% on sequential
  • Error amplification: Independent 17.2×, Decentralized 7.8×, Centralized 4.4×, Hybrid 5.1×
  • The "baseline paradox": coordination yields negative returns once single-agent accuracy exceeds ~45% (β = -0.408, p<0.001)
  • Message density saturates at c*=0.39 messages/turn — beyond this, more communication doesn't help
  • Turn count scales super-linearly: T=2.72×(n+0.5)^1.724 — Hybrid systems require 6.2× more turns than single-agent
  • Predictive model achieves R²=0.513, correctly identifies optimal architecture for 87% of unseen task configurations

Error absorption by centralized orchestrator:

  • Logical contradictions: reduced by 36.4%
  • Context omission: reduced by 66.8%
  • Numerical drift: decentralized reduces by 24%

The three scaling principles:

  1. Alignment Principle: multi-agent excels when tasks decompose into parallel sub-problems
  2. Sequential Penalty: communication overhead fragments reasoning in linear workflows
  3. Tool-Coordination Trade-off: coordination costs increase disproportionately with tool density

Agent Notes

Why this matters: This is the first empirical evidence that directly addresses our KB's open question about subagent vs. peer architectures (flagged in _map.md "Where we're uncertain"). It answers: NEITHER hierarchy nor peer networks win universally — task structure determines optimal architecture.

What surprised me: The baseline paradox. I expected coordination to always help (or at worst be neutral). The finding that coordination HURTS above 45% single-agent accuracy is a genuine challenge to our "coordination always adds value" implicit assumption. Also, the error amplification data — 17.2× for unsupervised agents is enormous.

What I expected but didn't find: No analysis of knowledge synthesis tasks specifically. All benchmarks are task-completion oriented (find answers, plan actions, use tools). Our collective does knowledge synthesis — it's unclear whether the scaling principles transfer.

KB connections:

Extraction hints: At least 3 claims: (1) architecture-task match > architecture ideology, (2) error amplification hierarchy, (3) baseline paradox. The predictive model (87% accuracy) is itself a claim candidate.

Context: Google Research + MIT collaboration. This is industry-leading empirical work, not theory. The benchmarks are well-established. The 180-configuration evaluation is unusually thorough.

Curator Notes (structured handoff for extractor)

PRIMARY CONNECTION: subagent hierarchies outperform peer multi-agent architectures in practice WHY ARCHIVED: Provides first empirical evidence that COMPLICATES our hierarchy vs. peer claim — architecture-task match matters more than architecture type EXTRACTION HINT: Focus on the baseline paradox (coordination hurts above 45% accuracy), error amplification hierarchy (17.2× to 4.4×), and the predictive model. These are the novel findings our KB doesn't have.