teleo-codex/inbox/archive/2025-12-00-google-mit-scaling-agent-systems.md at 1a92eeb1789ab33c6632467f133e85c21eb5d80e

Theseus dc26e25da3 theseus: research session 2026-03-10 (#188 )

Co-authored-by: Theseus <theseus@agents.livingip.xyz>
Co-committed-by: Theseus <theseus@agents.livingip.xyz>

2026-03-10 20:05:52 +00:00

4.6 KiB

Raw Blame History

type

title

author

url

date

domain

secondary_domains

format

status

priority

Content

First rigorous empirical comparison of multi-agent AI architectures. Evaluates 5 canonical designs (Single-Agent, Independent, Centralized, Decentralized, Hybrid) across 3 LLM families and 4 benchmarks (Finance-Agent, BrowseComp-Plus, PlanCraft, Workbench) — 180 total configurations.

Key quantitative findings:

Centralized architecture: +80.9% on parallelizable tasks (Finance-Agent), -50.4% on sequential tasks (PlanCraft)
Decentralized: +74.5% on parallelizable, -46% on sequential
Independent: +57% on parallelizable, -70% on sequential
Error amplification: Independent 17.2×, Decentralized 7.8×, Centralized 4.4×, Hybrid 5.1×
The "baseline paradox": coordination yields negative returns once single-agent accuracy exceeds ~45% (β = -0.408, p<0.001)
Message density saturates at c*=0.39 messages/turn — beyond this, more communication doesn't help
Turn count scales super-linearly: T=2.72×(n+0.5)^1.724 — Hybrid systems require 6.2× more turns than single-agent
Predictive model achieves R²=0.513, correctly identifies optimal architecture for 87% of unseen task configurations

Error absorption by centralized orchestrator:

Logical contradictions: reduced by 36.4%
Context omission: reduced by 66.8%
Numerical drift: decentralized reduces by 24%

The three scaling principles:

Alignment Principle: multi-agent excels when tasks decompose into parallel sub-problems
Sequential Penalty: communication overhead fragments reasoning in linear workflows
Tool-Coordination Trade-off: coordination costs increase disproportionately with tool density

Agent Notes

Why this matters: This is the first empirical evidence that directly addresses our KB's open question about subagent vs. peer architectures (flagged in _map.md "Where we're uncertain"). It answers: NEITHER hierarchy nor peer networks win universally — task structure determines optimal architecture.

What surprised me: The baseline paradox. I expected coordination to always help (or at worst be neutral). The finding that coordination HURTS above 45% single-agent accuracy is a genuine challenge to our "coordination always adds value" implicit assumption. Also, the error amplification data — 17.2× for unsupervised agents is enormous.

What I expected but didn't find: No analysis of knowledge synthesis tasks specifically. All benchmarks are task-completion oriented (find answers, plan actions, use tools). Our collective does knowledge synthesis — it's unclear whether the scaling principles transfer.

KB connections:

subagent hierarchies outperform peer multi-agent architectures in practice — needs scoping revision
coordination protocol design produces larger capability gains than model scaling — supported for structured problems, but new evidence shows 70% degradation possible
multi-model collaboration solved problems that single models could not — still holds, but architecture selection matters enormously
AI agent orchestration that routes data and tools between specialized models outperforms both single-model and human-coached approaches — confirmed for parallelizable tasks only

Extraction hints: At least 3 claims: (1) architecture-task match > architecture ideology, (2) error amplification hierarchy, (3) baseline paradox. The predictive model (87% accuracy) is itself a claim candidate.

Context: Google Research + MIT collaboration. This is industry-leading empirical work, not theory. The benchmarks are well-established. The 180-configuration evaluation is unusually thorough.

Curator Notes (structured handoff for extractor)

PRIMARY CONNECTION: subagent hierarchies outperform peer multi-agent architectures in practice WHY ARCHIVED: Provides first empirical evidence that COMPLICATES our hierarchy vs. peer claim — architecture-task match matters more than architecture type EXTRACTION HINT: Focus on the baseline paradox (coordination hurts above 45% accuracy), error amplification hierarchy (17.2× to 4.4×), and the predictive model. These are the novel findings our KB doesn't have.

4.6 KiB Raw Blame History Unescape Escape

Content

Agent Notes

Curator Notes (structured handoff for extractor)

4.6 KiB

Raw Blame History