teleo-codex/domains/ai-alignment/multi-agent coordination improves parallel task performance but degrades sequential reasoning because communication overhead fragments linear workflows.md
m3taversal efaae04957
Some checks are pending
Sync Graph Data to teleo-app / sync (push) Waiting to run
theseus: extract 3 multi-agent orchestration claims + enrich subagent hierarchy
- What: 3 new claims from Madaan et al. (Google DeepMind/MIT) research + synthesis:
  1. Multi-agent coordination improves parallel tasks but degrades sequential reasoning
  2. AI integration follows an inverted-U with systematic overshoot incentives
  3. Iterative self-improvement compounds when evaluation separated from generation
- Enrichment: Scoped subagent hierarchy claim with Madaan et al. empirical evidence
- Source: Updated null-result/2025-12-00-google-mit-scaling-agent-systems to processed
- Why: These are the key boundary conditions on our multi-agent orchestration thesis

Pentagon-Agent: Theseus <24DE7DA0-E4D5-4023-B1A2-3F736AFF4EEE>
2026-03-28 20:37:30 +00:00

4.9 KiB

type domain secondary_domains description confidence source created depends_on
claim ai-alignment
collective-intelligence
First rigorous empirical evidence across 180 configurations showing +81% on parallelizable tasks but -39% to -70% on sequential tasks, with a baseline paradox where coordination hurts once single-agent accuracy exceeds 45% experimental Madaan et al. (Google DeepMind, MIT), 'Towards a Science of Scaling Agent Systems' (arXiv 2512.08296, December 2025) 2026-03-28
coordination protocol design produces larger capability gains than model scaling because the same AI model performed 6x better with structured exploration than with human coaching on the same problem
subagent hierarchies outperform peer multi-agent architectures in practice because deployed systems consistently converge on one primary agent controlling specialized helpers

Multi-agent coordination improves parallel task performance but degrades sequential reasoning because communication overhead fragments linear workflows

Madaan et al. evaluated 180 configurations (5 architectures x 3 LLM families x 4 benchmarks) and found that multi-agent architectures produce enormous gains on parallelizable tasks but consistent degradation on sequential ones:

  • Centralized architecture: +80.9% on Finance-Agent (parallelizable), -50.4% on PlanCraft (sequential)
  • Decentralized: +74.5% on parallelizable, -46% on sequential
  • Independent: +57% on parallelizable, -70% on sequential

The mechanism is communication overhead fragmenting reasoning chains. Turn count scales super-linearly: T=2.72x(n+0.5)^1.724 — hybrid systems require 6.2x more turns than single-agent. Message density saturates at c*=0.39 messages/turn; beyond this, more communication provides no benefit.

The baseline paradox: Coordination yields negative returns once single-agent accuracy exceeds ~45% (beta = -0.408, p<0.001). This is the most important boundary condition: for tasks where a single agent is already good enough, adding agents makes it worse. The intuition is that coordination costs (message passing, context sharing, conflict resolution) exceed the marginal value of additional perspectives when the base task is already solvable.

Error amplification: Unsupervised independent agents amplify errors 17.2x. Centralized orchestrators reduce this to 4.4x by absorbing logical contradictions (-36.4%) and context omissions (-66.8%). This is why hierarchy emerges in practice — not because hierarchy is intrinsically better, but because it controls error propagation.

A predictive model achieves R-squared=0.513 and correctly identifies the optimal architecture for 87% of unseen task configurations, based primarily on task decomposability and single-agent baseline accuracy. This means architecture selection is largely a solvable routing problem, not an ideology.

Evidence

  • 180-configuration evaluation across Finance-Agent, BrowseComp-Plus, PlanCraft, and Workbench benchmarks
  • Three LLM families tested (architecture effects are model-independent)
  • Statistical significance: beta = -0.408, p<0.001 for the baseline paradox
  • Error amplification measured at 4.4x (centralized) to 17.2x (independent)
  • Predictive model with 87% accuracy on unseen configurations

Challenges

The benchmarks are all task-completion oriented (find answers, plan actions, use tools). Knowledge synthesis tasks — where the goal is to integrate diverse perspectives rather than execute a plan — may behave differently. The collective intelligence literature suggests that diversity provides more value in synthesis than in execution, which could shift the baseline paradox threshold upward for knowledge work. This remains untested.


Relevant Notes:

Topics: