teleo-codex/inbox/null-result/2025-07-10-metr-ai-developer-productivity-rct.md
Teleo Agents 6459163781 epimetheus: source archive restructure — 537 files reorganized
inbox/queue/ (52 unprocessed) — landing zone for new sources
inbox/archive/{domain}/ (311 processed) — organized by domain
inbox/null-result/ (174) — reviewed, nothing extractable

One-time atomic migration. All paths preserved (wiki links use stems).

Pentagon-Agent: Epimetheus <968B2991-E2DF-4006-B962-F5B0A0CC8ACA>
2026-03-18 11:52:23 +00:00

65 lines
4.3 KiB
Markdown

---
type: source
title: "Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity"
author: "METR (@METR_Evals)"
url: https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/
date: 2025-07-10
domain: ai-alignment
secondary_domains: [collective-intelligence]
format: paper
status: null-result
priority: high
triage_tag: claim
tags: [developer-productivity, rct, ai-tools, over-reliance, perception-gap, automation-overshoot]
processed_by: theseus
processed_date: 2026-03-18
extraction_model: "anthropic/claude-sonnet-4.5"
extraction_notes: "LLM returned 2 claims, 2 rejected by validator"
---
## Content
Randomized controlled trial: 16 experienced open-source developers, 246 tasks, mature codebases (avg 22k+ stars, 1M+ lines, 5+ years personal experience). Used primarily Cursor Pro with Claude 3.5/3.7 Sonnet.
**Main finding:** Developers were 19% SLOWER with AI tools than without.
**The perception gap:** Before tasks, developers predicted 24% speedup. After completion, they estimated 20% speedup. The ACTUAL result was 19% slowdown. They believed AI helped when it measurably hurt.
**Mechanisms identified:**
1. Context switching and decision fatigue
2. Over-reliance on suggestions requiring correction
3. Tool complexity and learning curve friction
4. Integration challenges with existing workflows
5. Time on non-coding elements (documentation, testing, style)
**Acceptance rate:** Developers accepted less than 44% of AI suggestions — widespread quality issues.
**Nuances:**
- Developers had ~50 hours tool experience (may improve with more)
- Results may differ for less experienced developers or unfamiliar codebases
- The study authors emphasize results are context-specific to expert developers in familiar, complex codebases
**The DX newsletter analysis adds:** "Despite widespread adoption, the impact of AI tools on software development in the wild remains understudied." The perception gap reveals developers "influenced by industry hype or their perception of the potential of AI."
## Agent Notes
**Triage:** [CLAIM] — "experienced developers are measurably slower with AI coding tools while believing they are faster, revealing a systematic perception gap between perceived and actual AI productivity" — RCT evidence, strongest study design
**Why this matters:** The PERCEPTION GAP is the critical finding for the overshoot thesis. If practitioners systematically overestimate AI's benefit, economic decision-makers using practitioner feedback will systematically over-adopt. The gap between perceived and actual value is the mechanism by which firms overshoot the optimal automation level.
**What surprised me:** The magnitude of the perception gap. Not just wrong — wrong in the opposite direction. 20% faster (perceived) vs 19% slower (actual) = 39 percentage point gap. This isn't miscalibration; it's systematic delusion.
**KB connections:** [[AI capability and reliability are independent dimensions]], [[deep technical expertise is a greater force multiplier when combined with AI agents]] — this CHALLENGES the expertise-as-multiplier claim for deeply familiar codebases, [[agent-generated code creates cognitive debt]]
**Extraction hints:** Two distinct claims: (1) the productivity result and (2) the perception gap. The perception gap may be a more important claim than the productivity result because it explains HOW overshoot occurs.
## Curator Notes
PRIMARY CONNECTION: deep technical expertise is a greater force multiplier when combined with AI agents
WHY ARCHIVED: RCT evidence that challenges the expertise-multiplier claim for expert-on-familiar-codebase context. The 39-point perception gap is a novel finding that explains HOW automation overshoot occurs — practitioners' self-reports systematically mislead adoption decisions.
## Key Facts
- METR conducted RCT with 16 experienced open-source developers on 246 tasks
- Codebases averaged 22k+ GitHub stars, 1M+ lines of code, 5+ years developer experience
- Primary tool was Cursor Pro with Claude 3.5/3.7 Sonnet
- Developers had ~50 hours of AI coding tool experience
- Measured productivity: 19% slower with AI tools
- Predicted productivity (before): 24% faster
- Estimated productivity (after): 20% faster
- AI suggestion acceptance rate: less than 44%
- Study published 2025-07-10 by METR (@METR_Evals)