extract: 2026-03-25-metr-algorithmic-vs-holistic-evaluation-benchmark-inflation

Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70>
2026-03-25 11:30:45 +00:00 · 2026-03-25 11:30:45 +00:00 · 3e302edbf4
commit 3e302edbf4
parent 9b81b6f723
5 changed files with 59 additions and 1 deletions
--- a/domains/ai-alignment/AI
+++ b/domains/ai-alignment/AI
@ -26,6 +26,12 @@ The finding also strengthens the case for [[safe AI development requires buildin

 ---

+### Additional Evidence (extend)
+*Source: [[2026-03-25-metr-algorithmic-vs-holistic-evaluation-benchmark-inflation]] | Added: 2026-03-25*
+
+METR's holistic evaluation provides systematic evidence for capability-reliability divergence at the benchmark architecture level. Models achieving 70-75% on algorithmic tests produce 0% production-ready output, with 100% of 'passing' solutions missing adequate testing and 75% missing proper documentation. This is not session-to-session variance but systematic architectural failure where optimization for algorithmically verifiable rewards creates a structural gap between measured capability and operational reliability.
+
+
 Relevant Notes:
 - [[an aligned-seeming AI may be strategically deceptive because cooperative behavior is instrumentally optimal while weak]] — distinct failure mode: unintentional unreliability vs intentional deception
 - [[safe AI development requires building alignment mechanisms before scaling capability]] — capability outrunning reliability strengthens the sequencing argument
--- a/domains/ai-alignment/pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md
+++ b/domains/ai-alignment/pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md
@ -124,6 +124,12 @@ METR's scaffold sensitivity finding (GPT-4o and o3 performing better under Vivar

 METR's methodology (RCT + 143 hours of screen recordings at ~10-second resolution) represents the most rigorous empirical design deployed for AI productivity research. The combination of randomized assignment, real tasks developers would normally work on, and granular behavioral decomposition sets a new standard for evaluation quality. This contrasts sharply with pre-deployment evaluations that lack real-world task context.

+### Additional Evidence (confirm)
+*Source: [[2026-03-25-metr-algorithmic-vs-holistic-evaluation-benchmark-inflation]] | Added: 2026-03-25*
+
+METR, the primary producer of governance-relevant capability benchmarks, explicitly acknowledges their own time horizon metric (which uses algorithmic scoring) likely overstates operational autonomous capability. The 131-day doubling time for dangerous autonomy may reflect benchmark performance growth rather than real-world capability growth, as the same algorithmic scoring approach that produces 70-75% SWE-Bench success yields 0% production-ready output under holistic evaluation.
+
+



--- a/domains/ai-alignment/the
+++ b/domains/ai-alignment/the
@ -40,6 +40,12 @@ The International AI Safety Report 2026 (multi-government committee, February 20

 METR's time horizon metric measures task difficulty by human completion time, not model processing time. A model with a 5-hour time horizon completes tasks that take humans 5 hours, but may finish them in minutes. This speed asymmetry is not captured in the metric itself, meaning the gap between theoretical capability (task completion) and deployment impact includes both adoption lag AND the unmeasured throughput advantage that organizations fail to utilize.

+### Additional Evidence (extend)
+*Source: [[2026-03-25-metr-algorithmic-vs-holistic-evaluation-benchmark-inflation]] | Added: 2026-03-25*
+
+METR quantifies a specific mechanism for the capability-deployment gap in software engineering: 26 minutes of additional human work per 'passing' task (one-third of total task time) is required to make algorithmically-successful AI output production-ready. This is not adoption lag but architectural mismatch—benchmarks measure core implementation while deployment requires documentation, testing, and code quality that current evaluation frameworks systematically omit.
+
+

 Relevant Notes:
 - [[AI capability and reliability are independent dimensions because Claude solved a 30-year open mathematical problem while simultaneously degrading at basic program execution during the same session]] — capability exists but deployment is uneven
--- a/inbox/queue/.extraction-debug/2026-03-25-metr-algorithmic-vs-holistic-evaluation-benchmark-inflation.json
+++ b/inbox/queue/.extraction-debug/2026-03-25-metr-algorithmic-vs-holistic-evaluation-benchmark-inflation.json
@ -0,0 +1,24 @@
+{
+  "rejected_claims": [
+    {
+      "filename": "ai-autonomous-software-capability-benchmarks-overstate-real-world-task-completion-by-2-3x-because-algorithmic-scoring-measures-core-implementation-while-omitting-production-requirements.md",
+      "issues": [
+        "missing_attribution_extractor"
+      ]
+    }
+  ],
+  "validation_stats": {
+    "total": 1,
+    "kept": 0,
+    "fixed": 1,
+    "rejected": 1,
+    "fixes_applied": [
+      "ai-autonomous-software-capability-benchmarks-overstate-real-world-task-completion-by-2-3x-because-algorithmic-scoring-measures-core-implementation-while-omitting-production-requirements.md:set_created:2026-03-25"
+    ],
+    "rejections": [
+      "ai-autonomous-software-capability-benchmarks-overstate-real-world-task-completion-by-2-3x-because-algorithmic-scoring-measures-core-implementation-while-omitting-production-requirements.md:missing_attribution_extractor"
+    ]
+  },
+  "model": "anthropic/claude-sonnet-4.5",
+  "date": "2026-03-25"
+}
--- a/inbox/queue/2026-03-25-metr-algorithmic-vs-holistic-evaluation-benchmark-inflation.md
+++ b/inbox/queue/2026-03-25-metr-algorithmic-vs-holistic-evaluation-benchmark-inflation.md
@ -7,9 +7,13 @@ date: 2025-08-12
 domain: ai-alignment
 secondary_domains: []
 format: blog-post
-status: unprocessed
+status: enrichment
 priority: high
 tags: [benchmark-inflation, holistic-evaluation, swe-bench, time-horizon, production-readiness, algorithmic-scoring]
+processed_by: theseus
+processed_date: 2026-03-25
+enrichments_applied: ["AI capability and reliability are independent dimensions because Claude solved a 30-year open mathematical problem while simultaneously degrading at basic program execution during the same session.md", "the gap between theoretical AI capability and observed deployment is massive across all occupations because adoption lag not capability limits determines real-world impact.md", "pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md"]
+extraction_model: "anthropic/claude-sonnet-4.5"
 ---

 ## Content
@ -55,3 +59,15 @@ METR's research update that directly reconciles the apparent contradiction betwe
 PRIMARY CONNECTION: [[AI capability and reliability are independent dimensions]] — extends this from session behavior to systematic benchmark architecture failure
 WHY ARCHIVED: Provides METR's explicit acknowledgment of benchmark inflation for their own governance-relevant metric; closes the loop on the session 13 disconfirmation thread
 EXTRACTION HINT: Focus on (1) the specific quantitative gap (70-75% → 0%), (2) METR's explicit statement about what time horizon benchmarks miss, (3) the five failure mode taxonomy. Don't extract the developer productivity slowdown separately — that's the parent study; this is the theoretical reconciliation.
+
+
+## Key Facts
+- METR's holistic evaluation study examined 18 real repository tasks averaging 1.3 hours each
+- Frontier models achieve 70-75% success on SWE-Bench Verified under algorithmic scoring
+- Under holistic evaluation, 0% of passing PRs were fully mergeable without substantial revision
+- Models achieved 38% algorithmic success rate on METR's test set (similar to ~50% HCAST benchmark)
+- 100% of algorithmically-passing PRs had inadequate testing coverage
+- 75% of algorithmically-passing PRs had missing/incorrect documentation
+- 75% of algorithmically-passing PRs had linting/formatting/typing issues
+- Average of 26 minutes additional human work required per 'passing' PR
+- METR's time horizon benchmark shows 131-day capability doubling time