| claim |
ai-alignment |
METR's HCAST benchmark showed 50-57% shifts in time horizon estimates between v1.0 and v1.1 for the same models, independent of actual capability change |
experimental |
METR GPT-5 evaluation report, HCAST v1.0 to v1.1 comparison |
2026-04-04 |
AI capability benchmarks exhibit 50% volatility between versions making governance thresholds derived from them unreliable moving targets |
theseus |
structural |
@METR_evals |
|
| The benchmark-reality gap creates an epistemic coordination failure in AI governance because algorithmic evaluation systematically overstates operational capability, making threshold-based coordination structurally miscalibrated even when all actors act in good faith |
|
| The benchmark-reality gap creates an epistemic coordination failure in AI governance because algorithmic evaluation systematically overstates operational capability, making threshold-based coordination structurally miscalibrated even when all actors act in good faith|supports|2026-04-17 |
|