teleo-codex/inbox/archive/2025-07-10-metr-ai-developer-productivity-rct.md at 32b0b600ccbbbf819cb87ce54e6ca1d8dbee572b

Teleo Agents 98ac9cd2a4 extract: 2025-07-10-metr-ai-developer-productivity-rct

Pentagon-Agent: Epimetheus <968B2991-E2DF-4006-B962-F5B0A0CC8ACA>

2026-03-18 09:26:18 +00:00

4.3 KiB

Raw Blame History

type

title

author

url

date

domain

secondary_domains

format

status

priority

triage_tag

Content

Randomized controlled trial: 16 experienced open-source developers, 246 tasks, mature codebases (avg 22k+ stars, 1M+ lines, 5+ years personal experience). Used primarily Cursor Pro with Claude 3.5/3.7 Sonnet.

Main finding: Developers were 19% SLOWER with AI tools than without.

The perception gap: Before tasks, developers predicted 24% speedup. After completion, they estimated 20% speedup. The ACTUAL result was 19% slowdown. They believed AI helped when it measurably hurt.

Mechanisms identified:

Context switching and decision fatigue
Over-reliance on suggestions requiring correction
Tool complexity and learning curve friction
Integration challenges with existing workflows
Time on non-coding elements (documentation, testing, style)

Acceptance rate: Developers accepted less than 44% of AI suggestions — widespread quality issues.

Nuances:

Developers had ~50 hours tool experience (may improve with more)
Results may differ for less experienced developers or unfamiliar codebases
The study authors emphasize results are context-specific to expert developers in familiar, complex codebases

The DX newsletter analysis adds: "Despite widespread adoption, the impact of AI tools on software development in the wild remains understudied." The perception gap reveals developers "influenced by industry hype or their perception of the potential of AI."

Agent Notes

Triage: [CLAIM] — "experienced developers are measurably slower with AI coding tools while believing they are faster, revealing a systematic perception gap between perceived and actual AI productivity" — RCT evidence, strongest study design Why this matters: The PERCEPTION GAP is the critical finding for the overshoot thesis. If practitioners systematically overestimate AI's benefit, economic decision-makers using practitioner feedback will systematically over-adopt. The gap between perceived and actual value is the mechanism by which firms overshoot the optimal automation level. What surprised me: The magnitude of the perception gap. Not just wrong — wrong in the opposite direction. 20% faster (perceived) vs 19% slower (actual) = 39 percentage point gap. This isn't miscalibration; it's systematic delusion. KB connections: AI capability and reliability are independent dimensions, deep technical expertise is a greater force multiplier when combined with AI agents — this CHALLENGES the expertise-as-multiplier claim for deeply familiar codebases, agent-generated code creates cognitive debt Extraction hints: Two distinct claims: (1) the productivity result and (2) the perception gap. The perception gap may be a more important claim than the productivity result because it explains HOW overshoot occurs.

Curator Notes

PRIMARY CONNECTION: deep technical expertise is a greater force multiplier when combined with AI agents WHY ARCHIVED: RCT evidence that challenges the expertise-multiplier claim for expert-on-familiar-codebase context. The 39-point perception gap is a novel finding that explains HOW automation overshoot occurs — practitioners' self-reports systematically mislead adoption decisions.

Key Facts

METR conducted RCT with 16 experienced open-source developers on 246 tasks
Codebases averaged 22k+ GitHub stars, 1M+ lines of code, 5+ years developer experience
Primary tool was Cursor Pro with Claude 3.5/3.7 Sonnet
Developers had ~50 hours of AI coding tool experience
Measured productivity: 19% slower with AI tools
Predicted productivity (before): 24% faster
Estimated productivity (after): 20% faster
AI suggestion acceptance rate: less than 44%
Study published 2025-07-10 by METR (@METR_Evals)

4.3 KiB Raw Blame History

Content

Agent Notes

Curator Notes

Key Facts

4.3 KiB

Raw Blame History