diff --git a/domains/ai-alignment/ai-tools-reduced-experienced-developer-productivity-in-rct-conditions-despite-predicted-speedup-suggesting-capability-deployment-does-not-translate-to-autonomy.md b/domains/ai-alignment/ai-tools-reduced-experienced-developer-productivity-in-rct-conditions-despite-predicted-speedup-suggesting-capability-deployment-does-not-translate-to-autonomy.md new file mode 100644 index 00000000..5bfee4d0 --- /dev/null +++ b/domains/ai-alignment/ai-tools-reduced-experienced-developer-productivity-in-rct-conditions-despite-predicted-speedup-suggesting-capability-deployment-does-not-translate-to-autonomy.md @@ -0,0 +1,17 @@ +--- +type: claim +domain: ai-alignment +description: "Experienced open-source developers using AI tools took 19% longer on tasks than without AI assistance in a randomized controlled trial, contradicting their own pre-study predictions" +confidence: experimental +source: METR, August 2025 developer productivity RCT +created: 2026-04-04 +title: "AI tools reduced experienced developer productivity by 19% in RCT conditions despite developer predictions of speedup, suggesting capability deployment does not automatically translate to autonomy gains" +agent: theseus +scope: causal +sourcer: METR +related_claims: ["[[the gap between theoretical AI capability and observed deployment is massive across all occupations because adoption lag not capability limits determines real-world impact]]", "[[deep technical expertise is a greater force multiplier when combined with AI agents because skilled practitioners delegate more effectively than novices]]", "[[agent-generated code creates cognitive debt that compounds when developers cannot understand what was produced on their behalf]]"] +--- + +# AI tools reduced experienced developer productivity by 19% in RCT conditions despite developer predictions of speedup, suggesting capability deployment does not automatically translate to autonomy gains + +METR conducted a randomized controlled trial with experienced open-source developers using AI tools. The result was counterintuitive: tasks took 19% longer with AI assistance than without. This finding is particularly striking because developers predicted significant speed-ups before the study began—creating a gap between expected and actual productivity impact. The RCT design (not observational) strengthens the finding by controlling for selection effects and confounding variables. METR published this as part of a reconciliation paper acknowledging tension between their time horizon results (showing rapid capability growth) and this developer productivity finding. The slowdown suggests that even when AI tools are adopted by experienced practitioners, the translation from capability to autonomy is not automatic. This challenges assumptions that capability improvements in benchmarks will naturally translate to productivity gains or autonomous operation in practice. The finding is consistent with the holistic evaluation result showing 0% production-ready code—both suggest that current AI capability creates work overhead rather than reducing it, even for skilled users. diff --git a/domains/ai-alignment/benchmark-based-ai-capability-metrics-overstate-real-world-autonomous-performance-because-automated-scoring-excludes-production-readiness-requirements.md b/domains/ai-alignment/benchmark-based-ai-capability-metrics-overstate-real-world-autonomous-performance-because-automated-scoring-excludes-production-readiness-requirements.md new file mode 100644 index 00000000..63dbd2ec --- /dev/null +++ b/domains/ai-alignment/benchmark-based-ai-capability-metrics-overstate-real-world-autonomous-performance-because-automated-scoring-excludes-production-readiness-requirements.md @@ -0,0 +1,17 @@ +--- +type: claim +domain: ai-alignment +description: "Claude 3.7 Sonnet achieved 38% success on automated tests but 0% production-ready code after human expert review, with all passing submissions requiring an average 42 minutes of additional work" +confidence: experimental +source: METR, August 2025 research reconciling developer productivity and time horizon findings +created: 2026-04-04 +title: Benchmark-based AI capability metrics overstate real-world autonomous performance because automated scoring excludes documentation, maintainability, and production-readiness requirements +agent: theseus +scope: structural +sourcer: METR +related_claims: ["[[pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations]]", "[[AI capability and reliability are independent dimensions because Claude solved a 30-year open mathematical problem while simultaneously degrading at basic program execution during the same session]]"] +--- + +# Benchmark-based AI capability metrics overstate real-world autonomous performance because automated scoring excludes documentation, maintainability, and production-readiness requirements + +METR evaluated Claude 3.7 Sonnet on 18 open-source software tasks using both algorithmic scoring (test pass/fail) and holistic human expert review. The model achieved a 38% success rate on automated test scoring, but human experts found 0% of the passing submissions were production-ready ('none of them are mergeable as-is'). Every passing-test run had testing coverage deficiencies (100%), 75% had documentation gaps, 75% had linting/formatting problems, and 25% had residual functionality gaps. Fixing agent PRs to production-ready required an average of 42 minutes of additional human work—roughly one-third of the original 1.3-hour human task time. METR explicitly states: 'Algorithmic scoring may overestimate AI agent real-world performance because benchmarks don't capture non-verifiable objectives like documentation quality and code maintainability—work humans must ultimately complete.' This creates a systematic measurement gap where capability metrics based on automated scoring (including METR's own time horizon estimates) may significantly overstate practical autonomous capability. The finding is particularly significant because it comes from METR itself—the primary organization measuring AI capability trajectories for dangerous autonomy.