extract: 2025-08-12-metr-algorithmic-vs-holistic-evaluation-developer-rct

Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70>
2026-03-24 00:16:48 +00:00 · 2026-03-24 00:16:48 +00:00 · c0773809bd
commit c0773809bd
parent 4e26ab9195
4 changed files with 62 additions and 1 deletions
--- a/domains/ai-alignment/pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md
+++ b/domains/ai-alignment/pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md
@ -104,6 +104,12 @@ IAISR 2026 states that 'pre-deployment testing increasingly fails to predict rea

 Anthropic's explicit admission that 'the science of model evaluation isn't well-developed enough to provide definitive threshold assessments' is direct confirmation from a frontier lab that evaluation tools are insufficient for governance. This aligns with METR's March 2026 modeling assumptions note, suggesting field-wide consensus that current evaluation science cannot support the governance structures built on top of it.

+### Additional Evidence (confirm)
+*Source: [[2025-08-12-metr-algorithmic-vs-holistic-evaluation-developer-rct]] | Added: 2026-03-24*
+
+METR's finding that algorithmic scoring (38% pass rate) completely failed to predict production readiness (0% mergeable) is direct empirical confirmation that pre-deployment evaluations based on automated metrics do not predict real-world performance. The 42-minute average fixing time quantifies the gap between evaluation and deployment reality.
+
+



--- a/domains/ai-alignment/the
+++ b/domains/ai-alignment/the
@ -40,6 +40,12 @@ The International AI Safety Report 2026 (multi-government committee, February 20

 METR's time horizon metric measures task difficulty by human completion time, not model processing time. A model with a 5-hour time horizon completes tasks that take humans 5 hours, but may finish them in minutes. This speed asymmetry is not captured in the metric itself, meaning the gap between theoretical capability (task completion) and deployment impact includes both adoption lag AND the unmeasured throughput advantage that organizations fail to utilize.

+### Additional Evidence (challenge)
+*Source: [[2025-08-12-metr-algorithmic-vs-holistic-evaluation-developer-rct]] | Added: 2026-03-24*
+
+METR's developer productivity RCT found that experienced developers were 19% SLOWER with AI tools despite full adoption and tool access. This challenges the assumption that adoption lag is the primary bottleneck—even when tools are fully adopted by skilled users, productivity may decline rather than improve. The gap may not be adoption lag but fundamental capability-deployment mismatch.
+
+

 Relevant Notes:
 - [[AI capability and reliability are independent dimensions because Claude solved a 30-year open mathematical problem while simultaneously degrading at basic program execution during the same session]] — capability exists but deployment is uneven
--- a/inbox/queue/.extraction-debug/2025-08-12-metr-algorithmic-vs-holistic-evaluation-developer-rct.json
+++ b/inbox/queue/.extraction-debug/2025-08-12-metr-algorithmic-vs-holistic-evaluation-developer-rct.json
@ -0,0 +1,33 @@
+{
+  "rejected_claims": [
+    {
+      "filename": "benchmark-based-capability-metrics-overstate-autonomous-performance-through-automated-scoring-gap.md",
+      "issues": [
+        "missing_attribution_extractor"
+      ]
+    },
+    {
+      "filename": "ai-tools-reduced-experienced-developer-productivity-19-percent-in-rct-conditions.md",
+      "issues": [
+        "missing_attribution_extractor"
+      ]
+    }
+  ],
+  "validation_stats": {
+    "total": 2,
+    "kept": 0,
+    "fixed": 3,
+    "rejected": 2,
+    "fixes_applied": [
+      "benchmark-based-capability-metrics-overstate-autonomous-performance-through-automated-scoring-gap.md:set_created:2026-03-24",
+      "benchmark-based-capability-metrics-overstate-autonomous-performance-through-automated-scoring-gap.md:stripped_wiki_link:verification degrades faster than capability grows",
+      "ai-tools-reduced-experienced-developer-productivity-19-percent-in-rct-conditions.md:set_created:2026-03-24"
+    ],
+    "rejections": [
+      "benchmark-based-capability-metrics-overstate-autonomous-performance-through-automated-scoring-gap.md:missing_attribution_extractor",
+      "ai-tools-reduced-experienced-developer-productivity-19-percent-in-rct-conditions.md:missing_attribution_extractor"
+    ]
+  },
+  "model": "anthropic/claude-sonnet-4.5",
+  "date": "2026-03-24"
+}
--- a/inbox/queue/2025-08-12-metr-algorithmic-vs-holistic-evaluation-developer-rct.md
+++ b/inbox/queue/2025-08-12-metr-algorithmic-vs-holistic-evaluation-developer-rct.md
@ -7,9 +7,13 @@ date: 2025-08-12
 domain: ai-alignment
 secondary_domains: []
 format: research-report
-status: unprocessed
+status: enrichment
 priority: high
 tags: [metr, developer-productivity, benchmark-inflation, capability-measurement, rct, holistic-evaluation, algorithmic-scoring, real-world-performance]
+processed_by: theseus
+processed_date: 2026-03-24
+enrichments_applied: ["the gap between theoretical AI capability and observed deployment is massive across all occupations because adoption lag not capability limits determines real-world impact.md", "pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md"]
+extraction_model: "anthropic/claude-sonnet-4.5"
 ---

 ## Content
@ -68,3 +72,15 @@ PRIMARY CONNECTION: [[verification degrades faster than capability grows]]
 WHY ARCHIVED: This is the strongest empirical evidence found in 13 sessions that benchmark-based capability metrics systematically overstate real-world autonomous capability. The RCT design (not observational), precise quantification (0% production-ready, 19% slowdown), and the source (METR — the primary capability evaluator) make this a high-quality disconfirmation signal for B1 urgency.

 EXTRACTION HINT: The extractor should develop the "benchmark-reality gap" as a potential new claim or divergence against existing time-horizon-based capability claims. The key question is whether this gap is stable, growing, or shrinking over model generations — if frontier models also show the gap, this updates the urgency of the entire six-layer governance arc.
+
+
+## Key Facts
+- Claude 3.7 Sonnet achieved 38% success rate on automated test scoring in METR evaluation
+- 0% of Claude 3.7 Sonnet's passing code was production-ready according to human expert review
+- 100% of passing-test agent PRs had testing coverage deficiencies
+- 75% of passing-test agent PRs had documentation gaps
+- 75% of passing-test agent PRs had linting/formatting problems
+- 25% of passing-test agent PRs had residual functionality gaps
+- Average time to fix agent PRs to production-ready: 42 minutes
+- Original human task time averaged 1.3 hours
+- Experienced developers using AI tools took 19% longer on tasks than without AI in METR RCT