6 claims + 1 source originally merged Mar 9 via GitHub squash merge. Forgejo→GitHub mirror overwrote GitHub main, erasing these files. Recovered from unreachable commit 9bd6c77c before GitHub GC. Added sourcer: alexastrum attribution to claim frontmatter.
4.3 KiB
| type | domain | secondary_domains | description | confidence | source | sourcer | created | |
|---|---|---|---|---|---|---|---|---|
| claim | ai-alignment |
|
SWE-AF deploys 400-500+ agents across planning, coding, reviewing, QA, and verification roles scoring 95/100 versus 73 for single-agent Claude Code, demonstrating that multi-agent coordination with continual learning has moved from research to production. | experimental | Alex — based on Compass research artifact analyzing SWE-AF, Cisco multi-agent PR reviewer, and BugBot (2026-03-08) | alexastrum | 2026-03-08 |
Multi-agent git workflows have reached production maturity as systems deploying 400+ specialized agent instances outperform single agents by 30 percent on engineering benchmarks
The pattern of Agent A proposing via PR and Agent B reviewing has moved from research concept to production system. Three implementations demonstrate different aspects of maturity.
SWE-AF (Agent Field) deploys 400-500+ agent instances across planning, coding, reviewing, QA, and verification roles, scoring 95/100 on benchmarks versus 73 for single-agent Claude Code. Each agent operates in an isolated git worktree, with a merger agent integrating branches and a verifier agent checking acceptance criteria against the PRD. Critically, SWE-AF implements continual learning: conventions and failure patterns discovered early are injected into downstream agent instances. This is not just parallelization — the system gets smarter as it works.
Cisco's multi-agent PR reviewer demonstrates the specific reviewer architecture: static analysis and code review agents run in parallel, a cross-referencing pipeline (initializer → generator → reflector) iterates on findings, and a comment filterer consolidates before posting. Built on LangGraph, it includes evaluation tooling that replays PR history with "LLM-as-a-judge" scoring.
BugBot implements the most rigorous adversarial review pattern: a self-referential execution loop where each iteration gets fresh context, picks new attack angles, and requires file:line evidence for every finding. Seven ODC trigger categories must each be tested, and consensus voting between independent agents auto-upgrades confidence when two agents flag the same issue.
The 95 vs 73 performance gap is significant because it demonstrates that coordination overhead is more than compensated by specialization benefits. This is consistent with the general finding that coordination protocol design produces larger capability gains than model scaling because the same AI model performed 6x better with structured exploration than with human coaching on the same problem — the gains come from structuring how agents interact, not from making individual agents more capable.
The continual learning component is particularly important for knowledge base applications. In a knowledge validation pipeline, conventions and failure patterns discovered during early reviews (e.g., "claims about mechanism design require quantitative evidence") can be injected into downstream reviewer instances, creating an improving review process without human intervention.
Relevant Notes:
- coordination protocol design produces larger capability gains than model scaling because the same AI model performed 6x better with structured exploration than with human coaching on the same problem — SWE-AF confirms this at production scale: coordination structure, not model capability, drives the performance gap
- AI agent orchestration that routes data and tools between specialized models outperforms both single-model and human-coached approaches because the orchestrator contributes coordination not direction — SWE-AF's merger and verifier agents are orchestration roles that contribute coordination
- tools and artifacts transfer between AI agents and evolve in the process because Agent O improved Agent Cs solver by combining it with its own structural knowledge creating a hybrid better than either original — SWE-AF's continual learning is this pattern at scale: conventions transfer and improve across instances
- centaur team performance depends on role complementarity not mere human-AI combination — role specialization (planner, coder, reviewer, QA, verifier) is why multi-agent outperforms single-agent
Topics: