teleo-codex/inbox/archive/2026-02-27-karpathy-8-agent-research-org.md
Theseus dc038b388f theseus: extract claims from 2026-02-27-karpathy-8-agent-research-org (#108)
Co-authored-by: Theseus <theseus@agents.livingip.xyz>
Co-committed-by: Theseus <theseus@agents.livingip.xyz>
2026-03-10 17:10:18 +00:00

5.5 KiB

type title author twitter_id url date domain secondary_domains format status priority tags flagged_for_theseus processed_by processed_date enrichments_applied extraction_model extraction_notes
source 8-agent research org experiments reveal agents generate bad ideas but execute well — the source code is now the org design Andrej Karpathy (@karpathy) 33836629 https://x.com/karpathy/status/2027521323275325622 2026-02-27 ai-alignment
collective-intelligence
tweet null-result high
multi-agent
research-org
agent-collaboration
prompt-engineering
organizational-design
Multi-model collaboration evidence — 8 agents, different setups, empirical failure modes
theseus 2026-03-10
AI agents excel at implementing well-scoped ideas but cannot generate creative experiment designs which makes the human role shift from researcher to agent workflow architect.md
minimax/minimax-m2.5 Two new claims extracted: (1) agents execute well but generate poor hypotheses - confirmed existing claim about idea generation vs implementation, (2) multi-agent orgs as programmable organizations - new framing on org design as source code. One enrichment confirmed existing claim about agent implementation vs hypothesis generation capabilities. Key facts preserved: 8 agents (4 Claude, 4 Codex), git worktrees for isolation, tmux grid for visualization, specific failure example of hidden size spurious correlation.

Content

I had the same thought so I've been playing with it in nanochat. E.g. here's 8 agents (4 claude, 4 codex), with 1 GPU each running nanochat experiments (trying to delete logit softcap without regression). The TLDR is that it doesn't work and it's a mess... but it's still very pretty to look at :)

I tried a few setups: 8 independent solo researchers, 1 chief scientist giving work to 8 junior researchers, etc. Each research program is a git branch, each scientist forks it into a feature branch, git worktrees for isolation, simple files for comms, skip Docker/VMs for simplicity atm (I find that instructions are enough to prevent interference). Research org runs in tmux window grids of interactive sessions (like Teams) so that it's pretty to look at, see their individual work, and "take over" if needed, i.e. no -p.

But ok the reason it doesn't work so far is that the agents' ideas are just pretty bad out of the box, even at highest intelligence. They don't think carefully though experiment design, they run a bit non-sensical variations, they don't create strong baselines and ablate things properly, they don't carefully control for runtime or flops. (just as an example, an agent yesterday "discovered" that increasing the hidden size of the network improves the validation loss, which is a totally spurious result given that a bigger network will have a lower validation loss in the infinite data regime, but then it also trains for a lot longer, it's not clear why I had to come in to point that out). They are very good at implementing any given well-scoped and described idea but they don't creatively generate them.

But the goal is that you are now programming an organization (e.g. a "research org") and its individual agents, so the "source code" is the collection of prompts, skills, tools, etc. and processes that make it up. E.g. a daily standup in the morning is now part of the "org code". And optimizing nanochat pretraining is just one of the many tasks (almost like an eval). Then - given an arbitrary task, how quickly does your research org generate progress on it?

Agent Notes

Why this matters: This is empirical evidence from the most credible source possible (Karpathy, running 8 agents on real GPU tasks) about what multi-agent collaboration actually looks like today. Key finding: agents execute well but generate bad ideas. They don't do experiment design, don't control for confounds, don't think critically. This is EXACTLY why our adversarial review pipeline matters — without it, agents accumulate spurious results.

KB connections:

Extraction hints:

  • Claim: AI agents execute well-scoped tasks reliably but generate poor research hypotheses — the bottleneck is idea generation not implementation
  • Claim: multi-agent research orgs are now programmable organizations where the source code is prompts, skills, tools and processes
  • Claim: different organizational structures (solo vs hierarchical) produce different research outcomes with identical agents
  • Claim: agents fail at experimental methodology (confound control, baseline comparison, ablation) even at highest intelligence settings

Context: Follow-up to the autoresearch SETI@home tweet. Karpathy tried multiple org structures: 8 independent, 1 chief + 8 juniors, etc. Used git worktrees for isolation (we use the same pattern in Pentagon). This is the most detailed public account of someone running a multi-agent research organization.