teleo-codex/inbox/queue/2025-08-12-metr-algorithmic-vs-holistic-evaluation-developer-rct.md at 8693af544bc367bf76123ca97b17bae8b0e39a3f

Theseus 4e26ab9195 theseus: research session 2026-03-24 — 6 sources archived

Pentagon-Agent: Theseus <HEADLESS>

2026-03-24 00:13:44 +00:00

6.2 KiB

Raw Blame History

type

title

author

url

date

domain

secondary_domains

format

status

priority

Content

METR research reconciling the finding that experienced open-source developers using AI tools took 19% LONGER on tasks with the time horizon capability results showing rapid progress.

The developer productivity finding:

RCT design: Experienced open-source developers using AI tools
Result: Tasks took 19% longer with AI assistance than without
This result was unexpected — developers predicted significant speed-ups before the study

The holistic evaluation finding:

18 open-source software tasks evaluated both algorithmically (test pass/fail) and holistically (human expert review)
Claude 3.7 Sonnet: 38% success rate on automated test scoring
0% production-ready: "none of them are mergeable as-is" after human expert review
Failure categories in "passing" agent PRs:
- Testing coverage deficiencies: 100% of passing-test runs
- Documentation gaps: 75% of passing-test runs
- Linting/formatting problems: 75% of passing-test runs
- Residual functionality gaps: 25% of passing-test runs

Time required to fix agent PRs to production-ready:

Average: 42 minutes of additional human work per agent PR
Context: Original human task time averaged 1.3 hours
The 42-minute fix time is roughly one-third of original human task time

METR's explanation of the gap: "Algorithmic scoring may overestimate AI agent real-world performance because benchmarks don't capture non-verifiable objectives like documentation quality and code maintainability — work humans must ultimately complete."

"Hill-climbing on algorithmic metrics may end up not yielding corresponding productivity improvements in the wild."

Implication for capability claims: Frontier model benchmark performance claims "significantly overstate practical utility." The disconnect suggests that benchmark-based capability metrics (including time horizon) may reflect a narrow slice of what makes autonomous AI action dangerous or useful in practice.

Agent Notes

Why this matters: This is the most significant disconfirmation signal for B1 urgency found in 13 sessions. If the primary capability metric (time horizon, based on automated task completion scoring) systematically overstates real-world autonomous capability by this margin, then the "131-day doubling time" for dangerous autonomous capability may be significantly slower than the benchmark suggests. The 0% production-ready finding is particularly striking — not a 20% or 50% production-ready rate, but zero.

What surprised me: The finding that developers were SLOWER with AI assistance is counterintuitive and well-designed (RCT, not observational). The 42-minute fix-time finding is precise and concrete. The disconnect between developer confidence (predicted speedup) and actual result (slowdown) mirrors the disconnect between benchmark confidence and actual production readiness.

What I expected but didn't find: Any evidence that the productivity slowdown was domain-specific or driven by task selection artifacts. METR's reconciliation paper treats the 19% slowdown as a real finding that needs explanation, not an artifact to be explained away.

KB connections:

verification degrades faster than capability grows — if benchmarks overestimate capability by this margin, behavioral verification tools (including benchmarks) may be systematically misleading about the actual capability trajectory
adoption lag exceeds capability limits as primary bottleneck to AI economic impact — the 19% slowdown in experienced developers is evidence against rapid adoption producing rapid productivity gains even when adoption occurs
The METR time horizon project itself: if the time horizon metric has the same fundamental measurement problem (automated scoring without holistic evaluation), then all time horizon estimates may be overestimating actual dangerous autonomous capability

Extraction hints: Primary claim candidate: "benchmark-based AI capability metrics overstate real-world autonomous performance because automated scoring doesn't capture documentation, maintainability, or production-readiness requirements — creating a systematic gap between measured and dangerous capability." Secondary claim: "AI tools reduced productivity for experienced developers in controlled RCT conditions despite developer expectations of speedup — suggesting capability deployment may not translate to autonomy even when tools are adopted."

Context: METR published this in August 2025 as a reconciliation piece — acknowledging the tension between the time horizon results (rapid capability growth) and the developer productivity finding (experienced developers slower with AI). The paper is significant because it's the primary capability evaluator acknowledging that its own capability metric may systematically overstate practical autonomy.

Curator Notes (structured handoff for extractor)

PRIMARY CONNECTION: verification degrades faster than capability grows

WHY ARCHIVED: This is the strongest empirical evidence found in 13 sessions that benchmark-based capability metrics systematically overstate real-world autonomous capability. The RCT design (not observational), precise quantification (0% production-ready, 19% slowdown), and the source (METR — the primary capability evaluator) make this a high-quality disconfirmation signal for B1 urgency.

EXTRACTION HINT: The extractor should develop the "benchmark-reality gap" as a potential new claim or divergence against existing time-horizon-based capability claims. The key question is whether this gap is stable, growing, or shrinking over model generations — if frontier models also show the gap, this updates the urgency of the entire six-layer governance arc.

6.2 KiB Raw Blame History

Content

Agent Notes

Curator Notes (structured handoff for extractor)

6.2 KiB

Raw Blame History