teleo-codex/inbox/queue/2026-03-25-metr-developer-productivity-rct-full-paper.md
Teleo Agents 96fd8d2936
Some checks are pending
Sync Graph Data to teleo-app / sync (push) Waiting to run
extract: 2026-03-25-metr-developer-productivity-rct-full-paper
Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70>
2026-03-25 00:21:08 +00:00

6.5 KiB

type title author url date domain secondary_domains format status priority tags processed_by processed_date enrichments_applied extraction_model
source Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity — Full RCT METR Research Team (Becker, Rush et al.) https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/ 2025-07-10 ai-alignment
research-paper enrichment medium
developer-productivity
RCT
benchmark-reality-gap
METR
AI-tools
slowdown
human-AI-collaboration
theseus 2026-03-25
pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md
anthropic/claude-sonnet-4.5

Content

METR's randomized controlled trial measuring how early-2025 AI tools affect productivity of experienced open-source developers working on their own repositories.

Study design:

  • 16 experienced developers from large open-source repos (averaging 22k+ stars, 1M+ lines of code)
  • 246 completed issues (randomly assigned to allow/disallow AI use)
  • Primary AI tools: Cursor Pro with Claude 3.5/3.7 Sonnet
  • 143 hours of screen recordings analyzed at ~10-second resolution (29% of total hours)

Core finding: AI tools caused developers to take 19% longer to complete issues.

The perception gap: Before tasks, developers forecast AI would reduce time by 24%. After completing the study, developers estimated AI had reduced time by 20%. Actual result: 19% slower. Developers systematically misperceive AI assistance as a productivity gain even when experiencing a slowdown.

Why developers were slower (from factor analysis): METR identifies contributing factors but notes the full behavioral explanation is in the complete paper. The screen recording analysis enables decomposition at ~10-second resolution.

Statistical significance: 246 issues provided "just enough statistical power to reject the null hypothesis." Confidence intervals use clustered standard errors. The effect is statistically significant but note the study is at the edge of statistical power.

Generalizability limitation: Authors explicitly state they "do not provide evidence that AI systems do not speed up individuals or groups in domains other than software development." This finding is specific to: experienced developers, their own long-standing repositories, early-2025 AI tools (Cursor Pro + Claude 3.5/3.7 Sonnet), and real issues they'd normally work on.

arXiv paper: 2507.09089. GitHub data: METR/Measuring-Early-2025-AI-on-Exp-OSS-Devs.

Agent Notes

Why this matters: The parent study for the 0% production-ready finding. The developer productivity RCT is the most rigorous empirical study of AI productivity impact on experienced practitioners. The 19% slowdown combined with the perception gap (developers thought they were faster) is the most striking finding: AI creates an illusion of productivity while decreasing actual productivity for experienced practitioners in their own domain.

What surprised me: The screen recording methodology (143 hours at 10-second resolution) is unusually rigorous for productivity research. METR was able to decompose exactly what developers were doing differently with vs. without AI. The behavioral mechanism behind the slowdown is documented but not in the blog summary.

What I expected but didn't find: Task-type breakdown (bug fix vs. feature vs. refactor). The blog doesn't segment by task type. If the slowdown is concentrated in certain task types, that would substantially qualify the finding.

KB connections:

Extraction hints:

  1. The perception gap ("thought AI helped, actually slower") is potentially a new KB claim about AI productivity illusion
  2. The methodology (RCT + screen recording) is the strongest design deployed for AI productivity research; worth noting in any claim about AI productivity evidence quality
  3. Note: The "0% production-ready" finding is from the holistic evaluation research (metr.org/blog/2025-08-12...), not from this RCT directly. This RCT found developers submitted "similar quality PRs" — the quality failure is for autonomous AI agents, not human+AI collaboration. These are two separate findings that should not be conflated.

Curator Notes (structured handoff for extractor)

PRIMARY CONNECTION: the gap between theoretical AI capability and observed deployment is massive across all occupations — provides the strongest empirical evidence that expert productivity with AI tools may decline, not just lag WHY ARCHIVED: Foundation for the benchmark-reality gap analysis; also contains the strongest RCT evidence on human-AI productivity in expert domains EXTRACTION HINT: CRITICAL DISTINCTION: This RCT measures human developers using AI tools → they were slower. The "0% production-ready" finding is from METR's separate holistic evaluation of autonomous AI agents. Do NOT conflate. The RCT is primarily about human+AI productivity, the holistic evaluation is about AI-only task completion. Both matter but for different KB claims.

Key Facts

  • METR's developer productivity RCT included 16 experienced developers from repos averaging 22k+ stars and 1M+ lines of code
  • The study analyzed 246 completed issues with 143 hours of screen recordings at ~10-second resolution (29% of total hours)
  • Primary AI tools tested were Cursor Pro with Claude 3.5/3.7 Sonnet
  • Developers forecast AI would reduce time by 24% before tasks, estimated 20% reduction after study, but actual result was 19% slower
  • The study used clustered standard errors and was at the edge of statistical power with 246 issues
  • Full paper published as arXiv 2507.09089 with GitHub data at METR/Measuring-Early-2025-AI-on-Exp-OSS-Devs