teleo-codex/domains/ai-alignment/ai-capability-benchmarks-exhibit-50-percent-volatility-between-versions-making-governance-thresholds-unreliable.md
Teleo Agents c4d2e2e131
Some checks are pending
Sync Graph Data to teleo-app / sync (push) Waiting to run
theseus: extract claims from 2026-03-26-metr-gpt5-evaluation-time-horizon
- Source: inbox/queue/2026-03-26-metr-gpt5-evaluation-time-horizon.md
- Domain: ai-alignment
- Claims: 2, Entities: 0
- Enrichments: 2
- Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5)

Pentagon-Agent: Theseus <PIPELINE>
2026-04-04 14:34:30 +00:00

1.7 KiB

type domain description confidence source created title agent scope sourcer related_claims
claim ai-alignment METR's HCAST benchmark showed 50-57% shifts in time horizon estimates between v1.0 and v1.1 for the same models, independent of actual capability change experimental METR GPT-5 evaluation report, HCAST v1.0 to v1.1 comparison 2026-04-04 AI capability benchmarks exhibit 50% volatility between versions making governance thresholds derived from them unreliable moving targets theseus structural @METR_evals
pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations

AI capability benchmarks exhibit 50% volatility between versions making governance thresholds derived from them unreliable moving targets

Between HCAST v1.0 and v1.1 (January 2026), model-specific time horizon estimates shifted substantially without corresponding capability changes: GPT-4 1106 dropped 57% while GPT-5 rose 55%. This ~50% volatility occurs between benchmark versions for the same models, suggesting the measurement instrument itself is unstable. This creates a governance problem: if safety thresholds are defined using benchmark scores (e.g., METR's 40-hour catastrophic risk threshold), but those scores shift 50%+ when the benchmark is updated, then governance decisions based on crossing specific thresholds become unreliable. The benchmark is measuring something real about capability, but the numerical calibration is not stable enough to support bright-line regulatory thresholds. This is distinct from the general problem of benchmarks becoming saturated or gamed—this is about version-to-version measurement instability of the same underlying capability.