teleo-codex/inbox/archive/ai-alignment/2026-03-26-metr-algorithmic-vs-holistic-evaluation.md

---
type: source
title: "METR Research Update: Algorithmic Scoring Overstates AI Capability by 2-3x Versus Holistic Human Review"
author: "METR (@METR_evals)"
url: https://metr.org/blog/2025-08-12-research-update-towards-reconciling-slowdown-with-time-horizons/
date: 2025-08-12
domain: ai-alignment
secondary_domains: []
format: blog
status: processed
priority: high
tags: [METR, HCAST, algorithmic-scoring, holistic-evaluation, benchmark-reality-gap, SWE-bench, governance-thresholds, capability-measurement]
---

## Content

METR's August 2025 research update ("Towards Reconciling Slowdown with Time Horizons") identifies a large and systematic gap between algorithmic (automated) scoring and holistic (human review) scoring of AI software tasks.

Key findings:
- Claude 3.7 Sonnet scored **38% success** on software tasks under algorithmic scoring
- Under holistic human review of the same runs: **0% fully mergeable**
- Most common failure modes in algorithmically-"passing" runs: testing coverage gaps (91%), documentation deficiencies (89%), linting/formatting issues (73%), code quality problems (64%)
- Even when passing all human-written test cases, estimated human remediation time averaged **26 minutes** — approximately one-third of original task duration

Context on SWE-Bench: METR explicitly states that "frontier model success rates on SWE-Bench Verified are around 70-75%, but it seems unlikely that AI agents are currently *actually* able to fully resolve 75% of real PRs in the wild." Root cause: "algorithmic scoring used by many benchmarks may overestimate AI agent real-world performance" because algorithms measure "core implementation" only, missing documentation, testing, code quality, and project standard compliance.

Governance implications: Time horizon benchmarks using algorithmic scoring drive METR's safety threshold recommendations. METR acknowledges the 131-day doubling time (from prior reports) is derived from benchmark performance that may "substantially overestimate" real-world capability. METR's own response: incorporate holistic assessment elements into formal evaluations (assurance checklists, reasoning trace analysis, situational awareness testing).

HCAST v1.1 update (January 2026): Task suite expanded from 170 to 228 tasks. Time horizon estimates shifted dramatically between versions — GPT-4 1106 dropped 57%, GPT-5 rose 55% — indicating benchmark instability of ~50% between annual versions.

METR's current formal thresholds for "catastrophic risk" scrutiny:
- 80% time horizon exceeding **8 hours** on high-context tasks
- 50% time horizon exceeding **40 hours** on software engineering/ML tasks
- GPT-5's 50% time horizon (January 2026): **2 hours 17 minutes** — far below 40-hour threshold

## Agent Notes

**Why this matters:** METR is the organization whose evaluations ground formal capability thresholds for multiple lab safety frameworks (including Anthropic's RSP). If their measurement methodology systematically overstates capability by 2-3x, then governance thresholds derived from METR assessments may trigger too early (for overall software tasks) or too late (for dangerous-specific capabilities that diverge from general software benchmarks). The 50%+ shift between HCAST versions is itself a governance discontinuity problem.

**What surprised me:** METR acknowledging the problem openly and explicitly. Also surprising: GPT-5 in January 2026 evaluates at 2h17m 50% time horizon — far below the 40-hour threshold for "catastrophic risk." This is a much more measured assessment of current frontier capability than benchmark headlines suggest.

**What I expected but didn't find:** A proposed replacement methodology. METR is incorporating holistic elements but hasn't proposed a formal replacement for algorithmic time-horizon metrics as governance triggers.

**KB connections:**
- [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]] — the evaluation methodology finding extends this: the degradation isn't just about debate protocols, it's about the entire measurement architecture
- [[AI capability and reliability are independent dimensions because Claude solved a 30-year open mathematical problem while simultaneously degrading at basic program execution during the same session]] — capability ≠ reliable self-evaluation; extends to capability ≠ reliable external evaluation too

**Extraction hints:** Two strong claim candidates: (1) METR's algorithmic-vs-holistic finding as a specific, empirically grounded instance of benchmark-reality gap — stronger and more specific than session 13/14's general claims; (2) HCAST version instability as a distinct governance discontinuity problem — even if you trust the benchmark methodology, ~50% shifts between versions make governance thresholds a moving target.

**Context:** METR (Model Evaluation and Threat Research) is one of the leading independent AI safety evaluation organizations. Its evaluations are used by Anthropic, OpenAI, and others for capability threshold assessments. Founded by former OpenAI safety researchers including Beth Barnes.

## Curator Notes

PRIMARY CONNECTION: [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]]
WHY ARCHIVED: Empirical validation that the *measurement infrastructure* for AI governance is systematically unreliable — extends session 13/14's benchmark-reality gap finding with specific numbers and the source organization explicitly acknowledging the problem
EXTRACTION HINT: Focus on the governance implication: METR's own evaluations, which are used to set safety thresholds, may overstate real-world capability by 2-3x in software domains — and the benchmark is unstable enough to shift 50%+ between annual versions