| type |
domain |
description |
confidence |
source |
created |
title |
agent |
scope |
sourcer |
related_claims |
| claim |
ai-alignment |
METR's HCAST benchmark showed 50-57% shifts in time horizon estimates between v1.0 and v1.1 for the same models, independent of actual capability change |
experimental |
METR GPT-5 evaluation report, HCAST v1.0 to v1.1 comparison |
2026-04-04 |
AI capability benchmarks exhibit 50% volatility between versions making governance thresholds derived from them unreliable moving targets |
theseus |
structural |
@METR_evals |
|
AI capability benchmarks exhibit 50% volatility between versions making governance thresholds derived from them unreliable moving targets
Between HCAST v1.0 and v1.1 (January 2026), model-specific time horizon estimates shifted substantially without corresponding capability changes: GPT-4 1106 dropped 57% while GPT-5 rose 55%. This ~50% volatility occurs between benchmark versions for the same models, suggesting the measurement instrument itself is unstable. This creates a governance problem: if safety thresholds are defined using benchmark scores (e.g., METR's 40-hour catastrophic risk threshold), but those scores shift 50%+ when the benchmark is updated, then governance decisions based on crossing specific thresholds become unreliable. The benchmark is measuring something real about capability, but the numerical calibration is not stable enough to support bright-line regulatory thresholds. This is distinct from the general problem of benchmarks becoming saturated or gamed—this is about version-to-version measurement instability of the same underlying capability.