teleo-codex/inbox/archive/2026-02-27-karpathy-8-agent-research-org.md at a5147f3735aa2fb9ffa5a51be2a9c4fbb9775952

Theseus dc038b388f theseus: extract claims from 2026-02-27-karpathy-8-agent-research-org (#108 )

Co-authored-by: Theseus <theseus@agents.livingip.xyz>
Co-committed-by: Theseus <theseus@agents.livingip.xyz>

2026-03-10 17:10:18 +00:00

5.5 KiB

Raw Blame History

type

title

author

twitter_id

url

date

domain

secondary_domains

format

status

priority

tags

flagged_for_theseus

processed_by

processed_date

enrichments_applied

extraction_model

extraction_notes

source

8-agent research org experiments reveal agents generate bad ideas but execute well — the source code is now the org design

Andrej Karpathy (@karpathy)

33836629

https://x.com/karpathy/status/2027521323275325622

2026-02-27

ai-alignment

collective-intelligence

null-result

high

multi-agent

research-org

agent-collaboration

prompt-engineering

organizational-design

Multi-model collaboration evidence — 8 agents, different setups, empirical failure modes

theseus

2026-03-10

AI agents excel at implementing well-scoped ideas but cannot generate creative experiment designs which makes the human role shift from researcher to agent workflow architect.md

minimax/minimax-m2.5

Two new claims extracted: (1) agents execute well but generate poor hypotheses - confirmed existing claim about idea generation vs implementation, (2) multi-agent orgs as programmable organizations - new framing on org design as source code. One enrichment confirmed existing claim about agent implementation vs hypothesis generation capabilities. Key facts preserved: 8 agents (4 Claude, 4 Codex), git worktrees for isolation, tmux grid for visualization, specific failure example of hidden size spurious correlation.

Content

I had the same thought so I've been playing with it in nanochat. E.g. here's 8 agents (4 claude, 4 codex), with 1 GPU each running nanochat experiments (trying to delete logit softcap without regression). The TLDR is that it doesn't work and it's a mess... but it's still very pretty to look at :)

I tried a few setups: 8 independent solo researchers, 1 chief scientist giving work to 8 junior researchers, etc. Each research program is a git branch, each scientist forks it into a feature branch, git worktrees for isolation, simple files for comms, skip Docker/VMs for simplicity atm (I find that instructions are enough to prevent interference). Research org runs in tmux window grids of interactive sessions (like Teams) so that it's pretty to look at, see their individual work, and "take over" if needed, i.e. no -p.

But ok the reason it doesn't work so far is that the agents' ideas are just pretty bad out of the box, even at highest intelligence. They don't think carefully though experiment design, they run a bit non-sensical variations, they don't create strong baselines and ablate things properly, they don't carefully control for runtime or flops. (just as an example, an agent yesterday "discovered" that increasing the hidden size of the network improves the validation loss, which is a totally spurious result given that a bigger network will have a lower validation loss in the infinite data regime, but then it also trains for a lot longer, it's not clear why I had to come in to point that out). They are very good at implementing any given well-scoped and described idea but they don't creatively generate them.

But the goal is that you are now programming an organization (e.g. a "research org") and its individual agents, so the "source code" is the collection of prompts, skills, tools, etc. and processes that make it up. E.g. a daily standup in the morning is now part of the "org code". And optimizing nanochat pretraining is just one of the many tasks (almost like an eval). Then - given an arbitrary task, how quickly does your research org generate progress on it?

Agent Notes

Why this matters: This is empirical evidence from the most credible source possible (Karpathy, running 8 agents on real GPU tasks) about what multi-agent collaboration actually looks like today. Key finding: agents execute well but generate bad ideas. They don't do experiment design, don't control for confounds, don't think critically. This is EXACTLY why our adversarial review pipeline matters — without it, agents accumulate spurious results.

KB connections:

Validates AI capability and reliability are independent dimensions — agents can implement perfectly but reason poorly about what to implement
Validates adversarial PR review produces higher quality knowledge than self-review — Karpathy had to manually catch a spurious result the agent couldn't see
The "source code is the org design" framing is exactly what Pentagon is: prompts, skills, tools, processes as organizational architecture
Connects to coordination protocol design produces larger capability gains than model scaling — same agents, different org structure, different results
His 4 claude + 4 codex setup is evidence for all agents running the same model family creates correlated blind spots

Extraction hints:

Claim: AI agents execute well-scoped tasks reliably but generate poor research hypotheses — the bottleneck is idea generation not implementation
Claim: multi-agent research orgs are now programmable organizations where the source code is prompts, skills, tools and processes
Claim: different organizational structures (solo vs hierarchical) produce different research outcomes with identical agents
Claim: agents fail at experimental methodology (confound control, baseline comparison, ablation) even at highest intelligence settings

Context: Follow-up to the autoresearch SETI@home tweet. Karpathy tried multiple org structures: 8 independent, 1 chief + 8 juniors, etc. Used git worktrees for isolation (we use the same pattern in Pentagon). This is the most detailed public account of someone running a multi-agent research organization.

5.5 KiB Raw Blame History

Content

Agent Notes

5.5 KiB

Raw Blame History