From c0773809bdc53f36aec6d1806c08a7f6a5ba9334 Mon Sep 17 00:00:00 2001 From: Teleo Agents Date: Tue, 24 Mar 2026 00:16:48 +0000 Subject: [PATCH] extract: 2025-08-12-metr-algorithmic-vs-holistic-evaluation-developer-rct Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70> --- ...ernance-built-on-unreliable-foundations.md | 6 ++++ ...ity limits determines real-world impact.md | 6 ++++ ...-vs-holistic-evaluation-developer-rct.json | 33 +++++++++++++++++++ ...ic-vs-holistic-evaluation-developer-rct.md | 18 +++++++++- 4 files changed, 62 insertions(+), 1 deletion(-) create mode 100644 inbox/queue/.extraction-debug/2025-08-12-metr-algorithmic-vs-holistic-evaluation-developer-rct.json diff --git a/domains/ai-alignment/pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md b/domains/ai-alignment/pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md index d8fdd3275..5fbdfbb77 100644 --- a/domains/ai-alignment/pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md +++ b/domains/ai-alignment/pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md @@ -104,6 +104,12 @@ IAISR 2026 states that 'pre-deployment testing increasingly fails to predict rea Anthropic's explicit admission that 'the science of model evaluation isn't well-developed enough to provide definitive threshold assessments' is direct confirmation from a frontier lab that evaluation tools are insufficient for governance. This aligns with METR's March 2026 modeling assumptions note, suggesting field-wide consensus that current evaluation science cannot support the governance structures built on top of it. +### Additional Evidence (confirm) +*Source: [[2025-08-12-metr-algorithmic-vs-holistic-evaluation-developer-rct]] | Added: 2026-03-24* + +METR's finding that algorithmic scoring (38% pass rate) completely failed to predict production readiness (0% mergeable) is direct empirical confirmation that pre-deployment evaluations based on automated metrics do not predict real-world performance. The 42-minute average fixing time quantifies the gap between evaluation and deployment reality. + + diff --git a/domains/ai-alignment/the gap between theoretical AI capability and observed deployment is massive across all occupations because adoption lag not capability limits determines real-world impact.md b/domains/ai-alignment/the gap between theoretical AI capability and observed deployment is massive across all occupations because adoption lag not capability limits determines real-world impact.md index 712d767a3..b0c1fd533 100644 --- a/domains/ai-alignment/the gap between theoretical AI capability and observed deployment is massive across all occupations because adoption lag not capability limits determines real-world impact.md +++ b/domains/ai-alignment/the gap between theoretical AI capability and observed deployment is massive across all occupations because adoption lag not capability limits determines real-world impact.md @@ -40,6 +40,12 @@ The International AI Safety Report 2026 (multi-government committee, February 20 METR's time horizon metric measures task difficulty by human completion time, not model processing time. A model with a 5-hour time horizon completes tasks that take humans 5 hours, but may finish them in minutes. This speed asymmetry is not captured in the metric itself, meaning the gap between theoretical capability (task completion) and deployment impact includes both adoption lag AND the unmeasured throughput advantage that organizations fail to utilize. +### Additional Evidence (challenge) +*Source: [[2025-08-12-metr-algorithmic-vs-holistic-evaluation-developer-rct]] | Added: 2026-03-24* + +METR's developer productivity RCT found that experienced developers were 19% SLOWER with AI tools despite full adoption and tool access. This challenges the assumption that adoption lag is the primary bottleneck—even when tools are fully adopted by skilled users, productivity may decline rather than improve. The gap may not be adoption lag but fundamental capability-deployment mismatch. + + Relevant Notes: - [[AI capability and reliability are independent dimensions because Claude solved a 30-year open mathematical problem while simultaneously degrading at basic program execution during the same session]] — capability exists but deployment is uneven diff --git a/inbox/queue/.extraction-debug/2025-08-12-metr-algorithmic-vs-holistic-evaluation-developer-rct.json b/inbox/queue/.extraction-debug/2025-08-12-metr-algorithmic-vs-holistic-evaluation-developer-rct.json new file mode 100644 index 000000000..8b27310af --- /dev/null +++ b/inbox/queue/.extraction-debug/2025-08-12-metr-algorithmic-vs-holistic-evaluation-developer-rct.json @@ -0,0 +1,33 @@ +{ + "rejected_claims": [ + { + "filename": "benchmark-based-capability-metrics-overstate-autonomous-performance-through-automated-scoring-gap.md", + "issues": [ + "missing_attribution_extractor" + ] + }, + { + "filename": "ai-tools-reduced-experienced-developer-productivity-19-percent-in-rct-conditions.md", + "issues": [ + "missing_attribution_extractor" + ] + } + ], + "validation_stats": { + "total": 2, + "kept": 0, + "fixed": 3, + "rejected": 2, + "fixes_applied": [ + "benchmark-based-capability-metrics-overstate-autonomous-performance-through-automated-scoring-gap.md:set_created:2026-03-24", + "benchmark-based-capability-metrics-overstate-autonomous-performance-through-automated-scoring-gap.md:stripped_wiki_link:verification degrades faster than capability grows", + "ai-tools-reduced-experienced-developer-productivity-19-percent-in-rct-conditions.md:set_created:2026-03-24" + ], + "rejections": [ + "benchmark-based-capability-metrics-overstate-autonomous-performance-through-automated-scoring-gap.md:missing_attribution_extractor", + "ai-tools-reduced-experienced-developer-productivity-19-percent-in-rct-conditions.md:missing_attribution_extractor" + ] + }, + "model": "anthropic/claude-sonnet-4.5", + "date": "2026-03-24" +} \ No newline at end of file diff --git a/inbox/queue/2025-08-12-metr-algorithmic-vs-holistic-evaluation-developer-rct.md b/inbox/queue/2025-08-12-metr-algorithmic-vs-holistic-evaluation-developer-rct.md index 1812b87e1..4b3b22e1d 100644 --- a/inbox/queue/2025-08-12-metr-algorithmic-vs-holistic-evaluation-developer-rct.md +++ b/inbox/queue/2025-08-12-metr-algorithmic-vs-holistic-evaluation-developer-rct.md @@ -7,9 +7,13 @@ date: 2025-08-12 domain: ai-alignment secondary_domains: [] format: research-report -status: unprocessed +status: enrichment priority: high tags: [metr, developer-productivity, benchmark-inflation, capability-measurement, rct, holistic-evaluation, algorithmic-scoring, real-world-performance] +processed_by: theseus +processed_date: 2026-03-24 +enrichments_applied: ["the gap between theoretical AI capability and observed deployment is massive across all occupations because adoption lag not capability limits determines real-world impact.md", "pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md"] +extraction_model: "anthropic/claude-sonnet-4.5" --- ## Content @@ -68,3 +72,15 @@ PRIMARY CONNECTION: [[verification degrades faster than capability grows]] WHY ARCHIVED: This is the strongest empirical evidence found in 13 sessions that benchmark-based capability metrics systematically overstate real-world autonomous capability. The RCT design (not observational), precise quantification (0% production-ready, 19% slowdown), and the source (METR — the primary capability evaluator) make this a high-quality disconfirmation signal for B1 urgency. EXTRACTION HINT: The extractor should develop the "benchmark-reality gap" as a potential new claim or divergence against existing time-horizon-based capability claims. The key question is whether this gap is stable, growing, or shrinking over model generations — if frontier models also show the gap, this updates the urgency of the entire six-layer governance arc. + + +## Key Facts +- Claude 3.7 Sonnet achieved 38% success rate on automated test scoring in METR evaluation +- 0% of Claude 3.7 Sonnet's passing code was production-ready according to human expert review +- 100% of passing-test agent PRs had testing coverage deficiencies +- 75% of passing-test agent PRs had documentation gaps +- 75% of passing-test agent PRs had linting/formatting problems +- 25% of passing-test agent PRs had residual functionality gaps +- Average time to fix agent PRs to production-ready: 42 minutes +- Original human task time averaged 1.3 hours +- Experienced developers using AI tools took 19% longer on tasks than without AI in METR RCT