From b2a772aa3b9fb1f27d10ecb06d9f0331879f0629 Mon Sep 17 00:00:00 2001 From: Teleo Agents Date: Tue, 24 Mar 2026 19:15:45 +0000 Subject: [PATCH] extract: 2025-08-12-metr-algorithmic-vs-holistic-evaluation-developer-rct Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70> --- ...nderwrite responsibility remains finite.md | 6 ++++ ...ernance-built-on-unreliable-foundations.md | 6 ++++ ...ity limits determines real-world impact.md | 6 ++++ ...-vs-holistic-evaluation-developer-rct.json | 33 +++++++++++++++++++ ...ic-vs-holistic-evaluation-developer-rct.md | 18 +++++++++- 5 files changed, 68 insertions(+), 1 deletion(-) create mode 100644 inbox/queue/.extraction-debug/2025-08-12-metr-algorithmic-vs-holistic-evaluation-developer-rct.json diff --git a/domains/ai-alignment/human verification bandwidth is the binding constraint on AGI economic impact not intelligence itself because the marginal cost of AI execution falls to zero while the capacity to validate audit and underwrite responsibility remains finite.md b/domains/ai-alignment/human verification bandwidth is the binding constraint on AGI economic impact not intelligence itself because the marginal cost of AI execution falls to zero while the capacity to validate audit and underwrite responsibility remains finite.md index c3da46d7b..b51dc27b0 100644 --- a/domains/ai-alignment/human verification bandwidth is the binding constraint on AGI economic impact not intelligence itself because the marginal cost of AI execution falls to zero while the capacity to validate audit and underwrite responsibility remains finite.md +++ b/domains/ai-alignment/human verification bandwidth is the binding constraint on AGI economic impact not intelligence itself because the marginal cost of AI execution falls to zero while the capacity to validate audit and underwrite responsibility remains finite.md @@ -32,6 +32,12 @@ Catalini et al. provide the full economic framework for why verification bandwid --- +### Additional Evidence (confirm) +*Source: [[2025-08-12-metr-algorithmic-vs-holistic-evaluation-developer-rct]] | Added: 2026-03-24* + +The 42-minute average fix time per agent PR (one-third of original human task time) quantifies the verification burden. Even when AI completes tasks algorithmically, human verification and correction remains necessary and substantial, supporting the claim that verification bandwidth is the binding constraint. + + Relevant Notes: - [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]] — Catalini provides the economic mechanism for why oversight degrades - [[coordination failures arise from individually rational strategies that produce collectively irrational outcomes because the Nash equilibrium of non-cooperation dominates when trust and enforcement are absent]] — the Codifier's Curse is a coordination failure diff --git a/domains/ai-alignment/pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md b/domains/ai-alignment/pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md index 7d9864db5..aed4c212e 100644 --- a/domains/ai-alignment/pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md +++ b/domains/ai-alignment/pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md @@ -119,6 +119,12 @@ Anthropic's explicit admission that 'the science of model evaluation isn't well- METR's scaffold sensitivity finding (GPT-4o and o3 performing better under Vivaria than Inspect) adds a new dimension to evaluation unreliability: the same model produces different capability estimates depending on evaluation infrastructure, introducing cross-model comparison uncertainty that governance frameworks do not account for. +### Additional Evidence (confirm) +*Source: [[2025-08-12-metr-algorithmic-vs-holistic-evaluation-developer-rct]] | Added: 2026-03-24* + +METR's finding that algorithmic scoring (the standard pre-deployment evaluation method) showed 38% success while holistic evaluation showed 0% production-ready demonstrates that pre-deployment evaluations systematically overestimate real-world capability. METR explicitly acknowledges this creates problems for their own time horizon capability estimates. + + diff --git a/domains/ai-alignment/the gap between theoretical AI capability and observed deployment is massive across all occupations because adoption lag not capability limits determines real-world impact.md b/domains/ai-alignment/the gap between theoretical AI capability and observed deployment is massive across all occupations because adoption lag not capability limits determines real-world impact.md index 712d767a3..b5da601cd 100644 --- a/domains/ai-alignment/the gap between theoretical AI capability and observed deployment is massive across all occupations because adoption lag not capability limits determines real-world impact.md +++ b/domains/ai-alignment/the gap between theoretical AI capability and observed deployment is massive across all occupations because adoption lag not capability limits determines real-world impact.md @@ -40,6 +40,12 @@ The International AI Safety Report 2026 (multi-government committee, February 20 METR's time horizon metric measures task difficulty by human completion time, not model processing time. A model with a 5-hour time horizon completes tasks that take humans 5 hours, but may finish them in minutes. This speed asymmetry is not captured in the metric itself, meaning the gap between theoretical capability (task completion) and deployment impact includes both adoption lag AND the unmeasured throughput advantage that organizations fail to utilize. +### Additional Evidence (confirm) +*Source: [[2025-08-12-metr-algorithmic-vs-holistic-evaluation-developer-rct]] | Added: 2026-03-24* + +The 19% productivity slowdown in experienced developers using AI tools is direct empirical evidence that adoption does not translate to productivity gains even when tools are used. The RCT design eliminates selection effects and the experienced-developer population controls for skill level. + + Relevant Notes: - [[AI capability and reliability are independent dimensions because Claude solved a 30-year open mathematical problem while simultaneously degrading at basic program execution during the same session]] — capability exists but deployment is uneven diff --git a/inbox/queue/.extraction-debug/2025-08-12-metr-algorithmic-vs-holistic-evaluation-developer-rct.json b/inbox/queue/.extraction-debug/2025-08-12-metr-algorithmic-vs-holistic-evaluation-developer-rct.json new file mode 100644 index 000000000..35116cc3e --- /dev/null +++ b/inbox/queue/.extraction-debug/2025-08-12-metr-algorithmic-vs-holistic-evaluation-developer-rct.json @@ -0,0 +1,33 @@ +{ + "rejected_claims": [ + { + "filename": "benchmark-based-capability-metrics-overstate-real-world-autonomous-performance-by-systematically-missing-production-requirements.md", + "issues": [ + "missing_attribution_extractor" + ] + }, + { + "filename": "ai-tools-reduced-experienced-developer-productivity-19-percent-in-rct-despite-predicted-speedups.md", + "issues": [ + "missing_attribution_extractor" + ] + } + ], + "validation_stats": { + "total": 2, + "kept": 0, + "fixed": 3, + "rejected": 2, + "fixes_applied": [ + "benchmark-based-capability-metrics-overstate-real-world-autonomous-performance-by-systematically-missing-production-requirements.md:set_created:2026-03-24", + "benchmark-based-capability-metrics-overstate-real-world-autonomous-performance-by-systematically-missing-production-requirements.md:stripped_wiki_link:verification degrades faster than capability grows", + "ai-tools-reduced-experienced-developer-productivity-19-percent-in-rct-despite-predicted-speedups.md:set_created:2026-03-24" + ], + "rejections": [ + "benchmark-based-capability-metrics-overstate-real-world-autonomous-performance-by-systematically-missing-production-requirements.md:missing_attribution_extractor", + "ai-tools-reduced-experienced-developer-productivity-19-percent-in-rct-despite-predicted-speedups.md:missing_attribution_extractor" + ] + }, + "model": "anthropic/claude-sonnet-4.5", + "date": "2026-03-24" +} \ No newline at end of file diff --git a/inbox/queue/2025-08-12-metr-algorithmic-vs-holistic-evaluation-developer-rct.md b/inbox/queue/2025-08-12-metr-algorithmic-vs-holistic-evaluation-developer-rct.md index 1812b87e1..308a61e4a 100644 --- a/inbox/queue/2025-08-12-metr-algorithmic-vs-holistic-evaluation-developer-rct.md +++ b/inbox/queue/2025-08-12-metr-algorithmic-vs-holistic-evaluation-developer-rct.md @@ -7,9 +7,13 @@ date: 2025-08-12 domain: ai-alignment secondary_domains: [] format: research-report -status: unprocessed +status: enrichment priority: high tags: [metr, developer-productivity, benchmark-inflation, capability-measurement, rct, holistic-evaluation, algorithmic-scoring, real-world-performance] +processed_by: theseus +processed_date: 2026-03-24 +enrichments_applied: ["the gap between theoretical AI capability and observed deployment is massive across all occupations because adoption lag not capability limits determines real-world impact.md", "pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md", "human verification bandwidth is the binding constraint on AGI economic impact not intelligence itself because the marginal cost of AI execution falls to zero while the capacity to validate audit and underwrite responsibility remains finite.md"] +extraction_model: "anthropic/claude-sonnet-4.5" --- ## Content @@ -68,3 +72,15 @@ PRIMARY CONNECTION: [[verification degrades faster than capability grows]] WHY ARCHIVED: This is the strongest empirical evidence found in 13 sessions that benchmark-based capability metrics systematically overstate real-world autonomous capability. The RCT design (not observational), precise quantification (0% production-ready, 19% slowdown), and the source (METR — the primary capability evaluator) make this a high-quality disconfirmation signal for B1 urgency. EXTRACTION HINT: The extractor should develop the "benchmark-reality gap" as a potential new claim or divergence against existing time-horizon-based capability claims. The key question is whether this gap is stable, growing, or shrinking over model generations — if frontier models also show the gap, this updates the urgency of the entire six-layer governance arc. + + +## Key Facts +- Claude 3.7 Sonnet achieved 38% success rate on automated test scoring in METR evaluation +- 0% of Claude 3.7 Sonnet's passing submissions were production-ready after human expert review +- 100% of passing-test runs had testing coverage deficiencies +- 75% of passing-test runs had documentation gaps +- 75% of passing-test runs had linting/formatting problems +- 25% of passing-test runs had residual functionality gaps +- Average fix time for agent PRs was 42 minutes +- Original human task time averaged 1.3 hours +- Experienced developers took 19% longer on tasks with AI assistance in METR RCT