Sync Graph Data to teleo-app / sync (push) Waiting to run

Details

theseus: extract claims from 2026-03-26-metr-gpt5-evaluation-time-horizon

- Source: inbox/queue/2026-03-26-metr-gpt5-evaluation-time-horizon.md
- Domain: ai-alignment
- Claims: 2, Entities: 0
- Enrichments: 2
- Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5)

Pentagon-Agent: Theseus <PIPELINE>

2026-04-04 14:34:30 +00:00

1.7 KiB

Raw Blame History

type

domain

description

confidence

source

created

title

agent

scope

sourcer

related_claims

claim

ai-alignment

METR's HCAST benchmark showed 50-57% shifts in time horizon estimates between v1.0 and v1.1 for the same models, independent of actual capability change

experimental

METR GPT-5 evaluation report, HCAST v1.0 to v1.1 comparison

2026-04-04

AI capability benchmarks exhibit 50% volatility between versions making governance thresholds derived from them unreliable moving targets

theseus

structural

@METR_evals

pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations

AI capability benchmarks exhibit 50% volatility between versions making governance thresholds derived from them unreliable moving targets

Between HCAST v1.0 and v1.1 (January 2026), model-specific time horizon estimates shifted substantially without corresponding capability changes: GPT-4 1106 dropped 57% while GPT-5 rose 55%. This ~50% volatility occurs between benchmark versions for the same models, suggesting the measurement instrument itself is unstable. This creates a governance problem: if safety thresholds are defined using benchmark scores (e.g., METR's 40-hour catastrophic risk threshold), but those scores shift 50%+ when the benchmark is updated, then governance decisions based on crossing specific thresholds become unreliable. The benchmark is measuring something real about capability, but the numerical calibration is not stable enough to support bright-line regulatory thresholds. This is distinct from the general problem of benchmarks becoming saturated or gamed—this is about version-to-version measurement instability of the same underlying capability.

1.7 KiB Raw Blame History

AI capability benchmarks exhibit 50% volatility between versions making governance thresholds derived from them unreliable moving targets

1.7 KiB

Raw Blame History