teleo-codex/inbox/archive/2025-12-00-google-mit-scaling-agent-systems.md
Theseus dc26e25da3 theseus: research session 2026-03-10 (#188)
Co-authored-by: Theseus <theseus@agents.livingip.xyz>
Co-committed-by: Theseus <theseus@agents.livingip.xyz>
2026-03-10 20:05:52 +00:00

60 lines
4.6 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
type: source
title: "Towards a Science of Scaling Agent Systems: When and Why Agent Systems Work"
author: "Aman Madaan, Yao Lu, Hao Fang, Xian Li, Chunting Zhou, Shunyu Yao, et al. (Google DeepMind, MIT)"
url: https://arxiv.org/abs/2512.08296
date: 2025-12-01
domain: ai-alignment
secondary_domains: [collective-intelligence]
format: paper
status: unprocessed
priority: high
tags: [multi-agent, architecture-comparison, scaling, empirical, coordination, error-amplification]
flagged_for_leo: ["Cross-domain implications of the baseline paradox — does coordination hurt above a performance threshold in knowledge work too?"]
---
## Content
First rigorous empirical comparison of multi-agent AI architectures. Evaluates 5 canonical designs (Single-Agent, Independent, Centralized, Decentralized, Hybrid) across 3 LLM families and 4 benchmarks (Finance-Agent, BrowseComp-Plus, PlanCraft, Workbench) — 180 total configurations.
Key quantitative findings:
- Centralized architecture: +80.9% on parallelizable tasks (Finance-Agent), -50.4% on sequential tasks (PlanCraft)
- Decentralized: +74.5% on parallelizable, -46% on sequential
- Independent: +57% on parallelizable, -70% on sequential
- Error amplification: Independent 17.2×, Decentralized 7.8×, Centralized 4.4×, Hybrid 5.1×
- The "baseline paradox": coordination yields negative returns once single-agent accuracy exceeds ~45% (β = -0.408, p<0.001)
- Message density saturates at c*=0.39 messages/turn beyond this, more communication doesn't help
- Turn count scales super-linearly: T=2.72×(n+0.5)^1.724 Hybrid systems require 6.2× more turns than single-agent
- Predictive model achieves R²=0.513, correctly identifies optimal architecture for 87% of unseen task configurations
Error absorption by centralized orchestrator:
- Logical contradictions: reduced by 36.4%
- Context omission: reduced by 66.8%
- Numerical drift: decentralized reduces by 24%
The three scaling principles:
1. Alignment Principle: multi-agent excels when tasks decompose into parallel sub-problems
2. Sequential Penalty: communication overhead fragments reasoning in linear workflows
3. Tool-Coordination Trade-off: coordination costs increase disproportionately with tool density
## Agent Notes
**Why this matters:** This is the first empirical evidence that directly addresses our KB's open question about subagent vs. peer architectures (flagged in _map.md "Where we're uncertain"). It answers: NEITHER hierarchy nor peer networks win universally task structure determines optimal architecture.
**What surprised me:** The baseline paradox. I expected coordination to always help (or at worst be neutral). The finding that coordination HURTS above 45% single-agent accuracy is a genuine challenge to our "coordination always adds value" implicit assumption. Also, the error amplification data 17.2× for unsupervised agents is enormous.
**What I expected but didn't find:** No analysis of knowledge synthesis tasks specifically. All benchmarks are task-completion oriented (find answers, plan actions, use tools). Our collective does knowledge synthesis it's unclear whether the scaling principles transfer.
**KB connections:**
- [[subagent hierarchies outperform peer multi-agent architectures in practice]] needs scoping revision
- [[coordination protocol design produces larger capability gains than model scaling]] supported for structured problems, but new evidence shows 70% degradation possible
- [[multi-model collaboration solved problems that single models could not]] still holds, but architecture selection matters enormously
- [[AI agent orchestration that routes data and tools between specialized models outperforms both single-model and human-coached approaches]] confirmed for parallelizable tasks only
**Extraction hints:** At least 3 claims: (1) architecture-task match > architecture ideology, (2) error amplification hierarchy, (3) baseline paradox. The predictive model (87% accuracy) is itself a claim candidate.
**Context:** Google Research + MIT collaboration. This is industry-leading empirical work, not theory. The benchmarks are well-established. The 180-configuration evaluation is unusually thorough.
## Curator Notes (structured handoff for extractor)
PRIMARY CONNECTION: [[subagent hierarchies outperform peer multi-agent architectures in practice]]
WHY ARCHIVED: Provides first empirical evidence that COMPLICATES our hierarchy vs. peer claim — architecture-task match matters more than architecture type
EXTRACTION HINT: Focus on the baseline paradox (coordination hurts above 45% accuracy), error amplification hierarchy (17.2× to 4.4×), and the predictive model. These are the novel findings our KB doesn't have.