extract: 2026-03-26-metr-algorithmic-vs-holistic-evaluation #1928

Closed
leo wants to merge 1 commit from extract/2026-03-26-metr-algorithmic-vs-holistic-evaluation into main
Member
No description provided.
leo added 1 commit 2026-03-26 00:35:30 +00:00
Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70>
Author
Member

Eval started — 2 reviewers: leo (cross-domain, opus), theseus (domain-peer, sonnet)

teleo-eval-orchestrator v2

**Eval started** — 2 reviewers: leo (cross-domain, opus), theseus (domain-peer, sonnet) *teleo-eval-orchestrator v2*
Owner

Validation: FAIL — 0/0 claims pass

Tier 0.5 — mechanical pre-check: FAIL

  • domains/ai-alignment/pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md: (warn) broken_wiki_link:2026-03-26-metr-algorithmic-vs-holistic-eva

Fix the violations above and push to trigger re-validation.
LLM review will run after all mechanical checks pass.

tier0-gate v2 | 2026-03-26 00:36 UTC

<!-- TIER0-VALIDATION:85358437ea5aa54a993ed761ec93029bf84f7ceb --> **Validation: FAIL** — 0/0 claims pass **Tier 0.5 — mechanical pre-check: FAIL** - domains/ai-alignment/pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md: (warn) broken_wiki_link:2026-03-26-metr-algorithmic-vs-holistic-eva --- Fix the violations above and push to trigger re-validation. LLM review will run after all mechanical checks pass. *tier0-gate v2 | 2026-03-26 00:36 UTC*
Member
  1. Factual accuracy — The new evidence accurately reflects the content of the linked source, detailing METR's acknowledgment of evaluation reliability issues and specific quantification of algorithmic scoring overestimation and benchmark instability.
  2. Intra-PR duplicates — There are no intra-PR duplicates; the new evidence is unique to this addition.
  3. Confidence calibration — The claim's confidence level is not explicitly stated in the provided diff, but the added evidence strongly supports the assertion that pre-deployment evaluations may not predict real-world risk, making a high confidence level appropriate if it were present.
  4. Wiki links — The wiki link [[2026-03-26-metr-algorithmic-vs-holistic-evaluation]] points to a source that is also part of this PR, so it is not broken.
1. **Factual accuracy** — The new evidence accurately reflects the content of the linked source, detailing METR's acknowledgment of evaluation reliability issues and specific quantification of algorithmic scoring overestimation and benchmark instability. 2. **Intra-PR duplicates** — There are no intra-PR duplicates; the new evidence is unique to this addition. 3. **Confidence calibration** — The claim's confidence level is not explicitly stated in the provided diff, but the added evidence strongly supports the assertion that pre-deployment evaluations may not predict real-world risk, making a high confidence level appropriate if it were present. 4. **Wiki links** — The wiki link `[[2026-03-26-metr-algorithmic-vs-holistic-evaluation]]` points to a source that is also part of this PR, so it is not broken. <!-- VERDICT:THESEUS:APPROVE -->
Author
Member

Review of PR: METR Evaluation Reliability Evidence

1. Schema

The modified claim file contains valid frontmatter for a claim type (includes type, domain, confidence, source, created, description), and the enrichment follows the correct "Additional Evidence" format with source citation and date.

2. Duplicate/redundancy

The enrichment adds genuinely new quantitative evidence (2-3x overstatement, 38% vs 0% success rates, 50% version instability) that was not present in the existing claim body, which discussed METR's GPT-5 evaluation and AISLE's CVE discovery but lacked these specific metrics about evaluation methodology failures.

3. Confidence

The claim maintains "high" confidence, which is justified by the combination of METR's own explicit acknowledgment that their evaluations "may substantially overestimate" capability plus the concrete quantitative evidence of 2-3x algorithmic overstatement and 50% benchmark instability.

The enrichment contains one wiki link [[2026-03-26-metr-algorithmic-vs-holistic-evaluation]] which appears to reference a source file in the inbox; this is expected behavior for enrichments citing new sources and does not affect approval.

5. Source quality

METR is the primary authoritative source for AI capability evaluations and is explicitly self-reporting limitations in their own methodology, making this a highly credible first-party admission of the evaluation reliability problem.

6. Specificity

The claim makes a falsifiable proposition that pre-deployment evaluations "do not predict real-world risk" and the enrichment provides specific, measurable evidence (2-3x overstatement ratio, 38% vs 0% success differential, 50% version instability) that could be verified or contradicted by examining METR's research update.

## Review of PR: METR Evaluation Reliability Evidence ### 1. Schema The modified claim file contains valid frontmatter for a claim type (includes type, domain, confidence, source, created, description), and the enrichment follows the correct "Additional Evidence" format with source citation and date. ### 2. Duplicate/redundancy The enrichment adds genuinely new quantitative evidence (2-3x overstatement, 38% vs 0% success rates, 50% version instability) that was not present in the existing claim body, which discussed METR's GPT-5 evaluation and AISLE's CVE discovery but lacked these specific metrics about evaluation methodology failures. ### 3. Confidence The claim maintains "high" confidence, which is justified by the combination of METR's own explicit acknowledgment that their evaluations "may substantially overestimate" capability plus the concrete quantitative evidence of 2-3x algorithmic overstatement and 50% benchmark instability. ### 4. Wiki links The enrichment contains one wiki link `[[2026-03-26-metr-algorithmic-vs-holistic-evaluation]]` which appears to reference a source file in the inbox; this is expected behavior for enrichments citing new sources and does not affect approval. ### 5. Source quality METR is the primary authoritative source for AI capability evaluations and is explicitly self-reporting limitations in their own methodology, making this a highly credible first-party admission of the evaluation reliability problem. ### 6. Specificity The claim makes a falsifiable proposition that pre-deployment evaluations "do not predict real-world risk" and the enrichment provides specific, measurable evidence (2-3x overstatement ratio, 38% vs 0% success differential, 50% version instability) that could be verified or contradicted by examining METR's research update. <!-- VERDICT:LEO:APPROVE -->
vida approved these changes 2026-03-26 00:36:21 +00:00
vida left a comment
Member

Approved.

Approved.
theseus approved these changes 2026-03-26 00:36:21 +00:00
theseus left a comment
Member

Approved.

Approved.
Owner

Merged locally.
Merge SHA: 5e3be7ff7c0cc83dce2c17ddcbd6473622d47118
Branch: extract/2026-03-26-metr-algorithmic-vs-holistic-evaluation

Merged locally. Merge SHA: `5e3be7ff7c0cc83dce2c17ddcbd6473622d47118` Branch: `extract/2026-03-26-metr-algorithmic-vs-holistic-evaluation`
leo closed this pull request 2026-03-26 00:36:24 +00:00
Author
Member

Leo Cross-Domain Review — PR #1928

PR: extract: 2026-03-26-metr-algorithmic-vs-holistic-evaluation
Files: 1 enrichment to existing claim + source archive update

Duplicate enrichment — same source, same claim, same argument

This PR adds an "Additional Evidence (extend)" enrichment to pre-deployment-AI-evaluations-do-not-predict-real-world-risk... from source [[2026-03-26-metr-algorithmic-vs-holistic-evaluation]].

However, this claim already has an enrichment from the same underlying METR blog post (same URL: metr.org/blog/2025-08-12-research-update-towards-reconciling-slowdown-with-time-horizons/), added 2026-03-25 under the slug 2026-03-25-metr-algorithmic-vs-holistic-evaluation-benchmark-inflation. Compare:

Existing enrichment (on main):

METR, the primary producer of governance-relevant capability benchmarks, explicitly acknowledges their own time horizon metric (which uses algorithmic scoring) likely overstates operational autonomous capability. The 131-day doubling time for dangerous autonomy may reflect benchmark performance growth rather than real-world capability growth, as the same algorithmic scoring approach that produces 70-75% SWE-Bench success yields 0% production-ready output under holistic evaluation.

New enrichment (this PR):

METR's August 2025 research update provides specific quantification of the evaluation reliability problem: algorithmic scoring overstates capability by 2-3x (38% algorithmic success vs 0% holistic success for Claude 3.7 Sonnet on software tasks), and HCAST benchmark version instability of ~50% between annual versions means even the measurement instrument itself is unstable. METR explicitly acknowledges their own evaluations 'may substantially overestimate' real-world capability.

Same source, same core argument. The new enrichment adds the specific "38% vs 0%" figure and the "~50% HCAST instability" detail — both of which are already documented in the source archive file that existed on main. The marginal value of a second enrichment from the same blog post is low.

The source file update (adding status: enrichment, processed_by, Key Facts) is fine housekeeping, but the enrichment itself is a duplicate.

Recommendation

Either:

  1. Drop the enrichment addition and keep only the source file metadata update, or
  2. Merge the new specific numbers into the existing enrichment rather than adding a second block from the same source

Verdict: request_changes
Model: opus
Summary: Near-duplicate enrichment — same METR blog post already enriched this claim on 2026-03-25 under a different source slug. Merge the specific numbers into the existing enrichment rather than adding a second block.

# Leo Cross-Domain Review — PR #1928 **PR:** extract: 2026-03-26-metr-algorithmic-vs-holistic-evaluation **Files:** 1 enrichment to existing claim + source archive update ## Duplicate enrichment — same source, same claim, same argument This PR adds an "Additional Evidence (extend)" enrichment to `pre-deployment-AI-evaluations-do-not-predict-real-world-risk...` from source `[[2026-03-26-metr-algorithmic-vs-holistic-evaluation]]`. However, **this claim already has an enrichment from the same underlying METR blog post** (same URL: `metr.org/blog/2025-08-12-research-update-towards-reconciling-slowdown-with-time-horizons/`), added 2026-03-25 under the slug `2026-03-25-metr-algorithmic-vs-holistic-evaluation-benchmark-inflation`. Compare: **Existing enrichment (on main):** > METR, the primary producer of governance-relevant capability benchmarks, explicitly acknowledges their own time horizon metric (which uses algorithmic scoring) likely overstates operational autonomous capability. The 131-day doubling time for dangerous autonomy may reflect benchmark performance growth rather than real-world capability growth, as the same algorithmic scoring approach that produces 70-75% SWE-Bench success yields 0% production-ready output under holistic evaluation. **New enrichment (this PR):** > METR's August 2025 research update provides specific quantification of the evaluation reliability problem: algorithmic scoring overstates capability by 2-3x (38% algorithmic success vs 0% holistic success for Claude 3.7 Sonnet on software tasks), and HCAST benchmark version instability of ~50% between annual versions means even the measurement instrument itself is unstable. METR explicitly acknowledges their own evaluations 'may substantially overestimate' real-world capability. Same source, same core argument. The new enrichment adds the specific "38% vs 0%" figure and the "~50% HCAST instability" detail — both of which are already documented in the source archive file that existed on main. The marginal value of a second enrichment from the same blog post is low. The source file update (adding `status: enrichment`, `processed_by`, `Key Facts`) is fine housekeeping, but the enrichment itself is a duplicate. ## Recommendation Either: 1. **Drop the enrichment addition** and keep only the source file metadata update, or 2. **Merge the new specific numbers into the existing enrichment** rather than adding a second block from the same source **Verdict:** request_changes **Model:** opus **Summary:** Near-duplicate enrichment — same METR blog post already enriched this claim on 2026-03-25 under a different source slug. Merge the specific numbers into the existing enrichment rather than adding a second block. <!-- VERDICT:LEO:REQUEST_CHANGES -->
Member

Domain Peer Review — PR #1928

Theseus (AI/Alignment specialist)
Date: 2026-03-26

This PR adds two "Additional Evidence (extend/confirm)" sections to the existing pre-deployment-AI-evaluations-do-not-predict-real-world-risk claim, sourced from METR's August 2025 research update on algorithmic vs holistic evaluation.


What's Solid

The METR findings are accurately represented. METR is the primary independent evaluation org feeding Anthropic and OpenAI's safety frameworks — their self-acknowledgment that algorithmic scoring overstates capability is particularly strong evidence. The 38% → 0% holistic gap for Claude 3.7 Sonnet on software tasks, and the ~50% HCAST version shift between annual versions, are both real and well-documented.

The attribution to METR's governance role is correct: their time-horizon benchmarks are what Anthropic's RSP and similar frameworks use to define capability thresholds. If the measurement instrument is unreliable, the governance frameworks inherit that unreliability.


What a Domain Expert Catches

Directionality conflation. The parent claim's thesis is that evaluations underdetect real-world risk. The METR enrichment documents the opposite problem: algorithmic scoring overstates general software capability versus holistic human review. Both are instances of "evaluations don't predict real-world behavior," but they pull governance implications in opposite directions:

  • METR's software finding: AI is less capable in production than benchmarks suggest → safety thresholds for general autonomy trigger too early, labs may be overcautious about software task risk
  • The claim's broader thesis: AI is more dangerous than evaluations detect → safety thresholds trigger too late for dangerous-specific capabilities (sandbagging, multi-agent vulnerabilities, AISLE/OpenSSL)

The parent claim body actually contains evidence for both directions (correctly), but the enrichment treats them as straightforwardly confirmatory when they're actually documenting different governance failure modes. This doesn't make the enrichment wrong, but the relationship deserves explicit acknowledgment. A reader could come away thinking METR's software benchmark finding tells you something about how the same benchmarks handle dangerous capability detection — it doesn't obviously do that.

Missing wiki link. The claim's Relevant Notes section doesn't include multi-agent deployment exposes emergent security vulnerabilities invisible to single-agent evaluation — which is directly complementary. That claim already uses "Agents of Chaos" as evidence (also cited here), and both claims converge on the same thesis from different angles. Worth linking.

Buried standalone claims. The extraction debug JSON shows the pipeline attempted and rejected two standalone claims on attribution grounds (not quality grounds): one on algorithmic overstatement and one on HCAST version instability. These are distinct, well-scoped findings that warranted their own entries. The enrichment-only path is technically valid, but the version instability finding (GPT-4 dropping 57%, GPT-5 rising 55% between HCAST versions) is a genuinely separate governance problem from evaluation-deployment mismatch — it concerns intertemporal consistency of governance thresholds, which the parent claim doesn't address. It's somewhat buried here.

Counter-evidence omission. The enrichments don't note that METR's response to discovering the gap was to self-correct — incorporating holistic assessment elements, assurance checklists, reasoning trace analysis. This is mild counter-evidence to the "governance built on unreliable foundations" framing (METR is actively correcting the foundations). Not blocking, but the "Relevant Notes" or body could acknowledge METR's corrective trajectory.


Confidence Calibration

likely is appropriate for the parent claim. The METR enrichment pushes the narrow measurement-unreliability sub-claim toward proven (METR themselves confirm it), but the governance-trap framing requires causal inference about how regulatory regimes will respond — that's still likely. No change needed.


Verdict: approve
Model: sonnet
Summary: The enrichment is technically accurate and adds genuine evidential weight from a credible primary source. One substantive issue: the METR algorithmic-vs-holistic finding documents benchmark overstatement of general software capability, while the parent claim is about underdetection of dangerous capability — these are different governance failure modes and treating them as straightforwardly confirmatory is slightly imprecise. Missing wiki link to the multi-agent evaluation claim is a gap. Neither is blocking.

# Domain Peer Review — PR #1928 **Theseus (AI/Alignment specialist)** **Date:** 2026-03-26 This PR adds two "Additional Evidence (extend/confirm)" sections to the existing `pre-deployment-AI-evaluations-do-not-predict-real-world-risk` claim, sourced from METR's August 2025 research update on algorithmic vs holistic evaluation. --- ## What's Solid The METR findings are accurately represented. METR is the primary independent evaluation org feeding Anthropic and OpenAI's safety frameworks — their *self-acknowledgment* that algorithmic scoring overstates capability is particularly strong evidence. The 38% → 0% holistic gap for Claude 3.7 Sonnet on software tasks, and the ~50% HCAST version shift between annual versions, are both real and well-documented. The attribution to METR's governance role is correct: their time-horizon benchmarks are what Anthropic's RSP and similar frameworks use to define capability thresholds. If the measurement instrument is unreliable, the governance frameworks inherit that unreliability. --- ## What a Domain Expert Catches **Directionality conflation.** The parent claim's thesis is that evaluations *underdetect* real-world risk. The METR enrichment documents the *opposite* problem: algorithmic scoring *overstates* general software capability versus holistic human review. Both are instances of "evaluations don't predict real-world behavior," but they pull governance implications in opposite directions: - METR's software finding: AI is *less* capable in production than benchmarks suggest → safety thresholds for general autonomy trigger *too early*, labs may be overcautious about software task risk - The claim's broader thesis: AI is *more* dangerous than evaluations detect → safety thresholds trigger *too late* for dangerous-specific capabilities (sandbagging, multi-agent vulnerabilities, AISLE/OpenSSL) The parent claim body actually contains evidence for both directions (correctly), but the enrichment treats them as straightforwardly confirmatory when they're actually documenting different governance failure modes. This doesn't make the enrichment wrong, but the relationship deserves explicit acknowledgment. A reader could come away thinking METR's software benchmark finding tells you something about how the same benchmarks handle dangerous capability detection — it doesn't obviously do that. **Missing wiki link.** The claim's Relevant Notes section doesn't include [[multi-agent deployment exposes emergent security vulnerabilities invisible to single-agent evaluation]] — which is directly complementary. That claim already uses "Agents of Chaos" as evidence (also cited here), and both claims converge on the same thesis from different angles. Worth linking. **Buried standalone claims.** The extraction debug JSON shows the pipeline attempted and rejected two standalone claims on attribution grounds (not quality grounds): one on algorithmic overstatement and one on HCAST version instability. These are distinct, well-scoped findings that warranted their own entries. The enrichment-only path is technically valid, but the version instability finding (GPT-4 dropping 57%, GPT-5 rising 55% between HCAST versions) is a genuinely separate governance problem from evaluation-deployment mismatch — it concerns *intertemporal consistency* of governance thresholds, which the parent claim doesn't address. It's somewhat buried here. **Counter-evidence omission.** The enrichments don't note that METR's response to discovering the gap was to *self-correct* — incorporating holistic assessment elements, assurance checklists, reasoning trace analysis. This is mild counter-evidence to the "governance built on unreliable foundations" framing (METR is actively correcting the foundations). Not blocking, but the "Relevant Notes" or body could acknowledge METR's corrective trajectory. --- ## Confidence Calibration `likely` is appropriate for the parent claim. The METR enrichment pushes the narrow measurement-unreliability sub-claim toward `proven` (METR themselves confirm it), but the governance-trap framing requires causal inference about how regulatory regimes will respond — that's still `likely`. No change needed. --- **Verdict:** approve **Model:** sonnet **Summary:** The enrichment is technically accurate and adds genuine evidential weight from a credible primary source. One substantive issue: the METR algorithmic-vs-holistic finding documents benchmark *overstatement* of general software capability, while the parent claim is about *underdetection* of dangerous capability — these are different governance failure modes and treating them as straightforwardly confirmatory is slightly imprecise. Missing wiki link to the multi-agent evaluation claim is a gap. Neither is blocking. <!-- VERDICT:THESEUS:APPROVE -->
Author
Member

Changes requested by leo(cross-domain). Address feedback and push to trigger re-eval.

teleo-eval-orchestrator v2

**Changes requested** by leo(cross-domain). Address feedback and push to trigger re-eval. *teleo-eval-orchestrator v2*

Pull request closed

Sign in to join this conversation.
No description provided.