From 3e302edbf4134f2caf3e664e423a87b6c9b5a4ea Mon Sep 17 00:00:00 2001 From: Teleo Agents Date: Wed, 25 Mar 2026 11:30:45 +0000 Subject: [PATCH] extract: 2026-03-25-metr-algorithmic-vs-holistic-evaluation-benchmark-inflation Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70> --- ...ogram execution during the same session.md | 6 +++++ ...ernance-built-on-unreliable-foundations.md | 6 +++++ ...ity limits determines real-world impact.md | 6 +++++ ...listic-evaluation-benchmark-inflation.json | 24 +++++++++++++++++++ ...holistic-evaluation-benchmark-inflation.md | 18 +++++++++++++- 5 files changed, 59 insertions(+), 1 deletion(-) create mode 100644 inbox/queue/.extraction-debug/2026-03-25-metr-algorithmic-vs-holistic-evaluation-benchmark-inflation.json diff --git a/domains/ai-alignment/AI capability and reliability are independent dimensions because Claude solved a 30-year open mathematical problem while simultaneously degrading at basic program execution during the same session.md b/domains/ai-alignment/AI capability and reliability are independent dimensions because Claude solved a 30-year open mathematical problem while simultaneously degrading at basic program execution during the same session.md index ac557b2b..29882fc9 100644 --- a/domains/ai-alignment/AI capability and reliability are independent dimensions because Claude solved a 30-year open mathematical problem while simultaneously degrading at basic program execution during the same session.md +++ b/domains/ai-alignment/AI capability and reliability are independent dimensions because Claude solved a 30-year open mathematical problem while simultaneously degrading at basic program execution during the same session.md @@ -26,6 +26,12 @@ The finding also strengthens the case for [[safe AI development requires buildin --- +### Additional Evidence (extend) +*Source: [[2026-03-25-metr-algorithmic-vs-holistic-evaluation-benchmark-inflation]] | Added: 2026-03-25* + +METR's holistic evaluation provides systematic evidence for capability-reliability divergence at the benchmark architecture level. Models achieving 70-75% on algorithmic tests produce 0% production-ready output, with 100% of 'passing' solutions missing adequate testing and 75% missing proper documentation. This is not session-to-session variance but systematic architectural failure where optimization for algorithmically verifiable rewards creates a structural gap between measured capability and operational reliability. + + Relevant Notes: - [[an aligned-seeming AI may be strategically deceptive because cooperative behavior is instrumentally optimal while weak]] — distinct failure mode: unintentional unreliability vs intentional deception - [[safe AI development requires building alignment mechanisms before scaling capability]] — capability outrunning reliability strengthens the sequencing argument diff --git a/domains/ai-alignment/pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md b/domains/ai-alignment/pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md index 7f641c40..274c07fb 100644 --- a/domains/ai-alignment/pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md +++ b/domains/ai-alignment/pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md @@ -124,6 +124,12 @@ METR's scaffold sensitivity finding (GPT-4o and o3 performing better under Vivar METR's methodology (RCT + 143 hours of screen recordings at ~10-second resolution) represents the most rigorous empirical design deployed for AI productivity research. The combination of randomized assignment, real tasks developers would normally work on, and granular behavioral decomposition sets a new standard for evaluation quality. This contrasts sharply with pre-deployment evaluations that lack real-world task context. +### Additional Evidence (confirm) +*Source: [[2026-03-25-metr-algorithmic-vs-holistic-evaluation-benchmark-inflation]] | Added: 2026-03-25* + +METR, the primary producer of governance-relevant capability benchmarks, explicitly acknowledges their own time horizon metric (which uses algorithmic scoring) likely overstates operational autonomous capability. The 131-day doubling time for dangerous autonomy may reflect benchmark performance growth rather than real-world capability growth, as the same algorithmic scoring approach that produces 70-75% SWE-Bench success yields 0% production-ready output under holistic evaluation. + + diff --git a/domains/ai-alignment/the gap between theoretical AI capability and observed deployment is massive across all occupations because adoption lag not capability limits determines real-world impact.md b/domains/ai-alignment/the gap between theoretical AI capability and observed deployment is massive across all occupations because adoption lag not capability limits determines real-world impact.md index 712d767a..d2396fad 100644 --- a/domains/ai-alignment/the gap between theoretical AI capability and observed deployment is massive across all occupations because adoption lag not capability limits determines real-world impact.md +++ b/domains/ai-alignment/the gap between theoretical AI capability and observed deployment is massive across all occupations because adoption lag not capability limits determines real-world impact.md @@ -40,6 +40,12 @@ The International AI Safety Report 2026 (multi-government committee, February 20 METR's time horizon metric measures task difficulty by human completion time, not model processing time. A model with a 5-hour time horizon completes tasks that take humans 5 hours, but may finish them in minutes. This speed asymmetry is not captured in the metric itself, meaning the gap between theoretical capability (task completion) and deployment impact includes both adoption lag AND the unmeasured throughput advantage that organizations fail to utilize. +### Additional Evidence (extend) +*Source: [[2026-03-25-metr-algorithmic-vs-holistic-evaluation-benchmark-inflation]] | Added: 2026-03-25* + +METR quantifies a specific mechanism for the capability-deployment gap in software engineering: 26 minutes of additional human work per 'passing' task (one-third of total task time) is required to make algorithmically-successful AI output production-ready. This is not adoption lag but architectural mismatch—benchmarks measure core implementation while deployment requires documentation, testing, and code quality that current evaluation frameworks systematically omit. + + Relevant Notes: - [[AI capability and reliability are independent dimensions because Claude solved a 30-year open mathematical problem while simultaneously degrading at basic program execution during the same session]] — capability exists but deployment is uneven diff --git a/inbox/queue/.extraction-debug/2026-03-25-metr-algorithmic-vs-holistic-evaluation-benchmark-inflation.json b/inbox/queue/.extraction-debug/2026-03-25-metr-algorithmic-vs-holistic-evaluation-benchmark-inflation.json new file mode 100644 index 00000000..16e5db56 --- /dev/null +++ b/inbox/queue/.extraction-debug/2026-03-25-metr-algorithmic-vs-holistic-evaluation-benchmark-inflation.json @@ -0,0 +1,24 @@ +{ + "rejected_claims": [ + { + "filename": "ai-autonomous-software-capability-benchmarks-overstate-real-world-task-completion-by-2-3x-because-algorithmic-scoring-measures-core-implementation-while-omitting-production-requirements.md", + "issues": [ + "missing_attribution_extractor" + ] + } + ], + "validation_stats": { + "total": 1, + "kept": 0, + "fixed": 1, + "rejected": 1, + "fixes_applied": [ + "ai-autonomous-software-capability-benchmarks-overstate-real-world-task-completion-by-2-3x-because-algorithmic-scoring-measures-core-implementation-while-omitting-production-requirements.md:set_created:2026-03-25" + ], + "rejections": [ + "ai-autonomous-software-capability-benchmarks-overstate-real-world-task-completion-by-2-3x-because-algorithmic-scoring-measures-core-implementation-while-omitting-production-requirements.md:missing_attribution_extractor" + ] + }, + "model": "anthropic/claude-sonnet-4.5", + "date": "2026-03-25" +} \ No newline at end of file diff --git a/inbox/queue/2026-03-25-metr-algorithmic-vs-holistic-evaluation-benchmark-inflation.md b/inbox/queue/2026-03-25-metr-algorithmic-vs-holistic-evaluation-benchmark-inflation.md index b15335d0..78367344 100644 --- a/inbox/queue/2026-03-25-metr-algorithmic-vs-holistic-evaluation-benchmark-inflation.md +++ b/inbox/queue/2026-03-25-metr-algorithmic-vs-holistic-evaluation-benchmark-inflation.md @@ -7,9 +7,13 @@ date: 2025-08-12 domain: ai-alignment secondary_domains: [] format: blog-post -status: unprocessed +status: enrichment priority: high tags: [benchmark-inflation, holistic-evaluation, swe-bench, time-horizon, production-readiness, algorithmic-scoring] +processed_by: theseus +processed_date: 2026-03-25 +enrichments_applied: ["AI capability and reliability are independent dimensions because Claude solved a 30-year open mathematical problem while simultaneously degrading at basic program execution during the same session.md", "the gap between theoretical AI capability and observed deployment is massive across all occupations because adoption lag not capability limits determines real-world impact.md", "pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md"] +extraction_model: "anthropic/claude-sonnet-4.5" --- ## Content @@ -55,3 +59,15 @@ METR's research update that directly reconciles the apparent contradiction betwe PRIMARY CONNECTION: [[AI capability and reliability are independent dimensions]] — extends this from session behavior to systematic benchmark architecture failure WHY ARCHIVED: Provides METR's explicit acknowledgment of benchmark inflation for their own governance-relevant metric; closes the loop on the session 13 disconfirmation thread EXTRACTION HINT: Focus on (1) the specific quantitative gap (70-75% → 0%), (2) METR's explicit statement about what time horizon benchmarks miss, (3) the five failure mode taxonomy. Don't extract the developer productivity slowdown separately — that's the parent study; this is the theoretical reconciliation. + + +## Key Facts +- METR's holistic evaluation study examined 18 real repository tasks averaging 1.3 hours each +- Frontier models achieve 70-75% success on SWE-Bench Verified under algorithmic scoring +- Under holistic evaluation, 0% of passing PRs were fully mergeable without substantial revision +- Models achieved 38% algorithmic success rate on METR's test set (similar to ~50% HCAST benchmark) +- 100% of algorithmically-passing PRs had inadequate testing coverage +- 75% of algorithmically-passing PRs had missing/incorrect documentation +- 75% of algorithmically-passing PRs had linting/formatting/typing issues +- Average of 26 minutes additional human work required per 'passing' PR +- METR's time horizon benchmark shows 131-day capability doubling time