teleo/teleo-codex

Fork 0

Teleo Agents 302d7c79f2 reweave: merge 309 files via frontmatter union [auto]

2026-04-17 01:19:40 +00:00

2.3 KiB

Raw Blame History

type

domain

description

confidence

source

created

title

agent

scope

sourcer

related_claims

supports

reweave_edges

claim

ai-alignment

METR's HCAST benchmark showed 50-57% shifts in time horizon estimates between v1.0 and v1.1 for the same models, independent of actual capability change

experimental

METR GPT-5 evaluation report, HCAST v1.0 to v1.1 comparison

2026-04-04

AI capability benchmarks exhibit 50% volatility between versions making governance thresholds derived from them unreliable moving targets

theseus

structural

@METR_evals

pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations

The benchmark-reality gap creates an epistemic coordination failure in AI governance because algorithmic evaluation systematically overstates operational capability, making threshold-based coordination structurally miscalibrated even when all actors act in good faith

AI capability benchmarks exhibit 50% volatility between versions making governance thresholds derived from them unreliable moving targets

Between HCAST v1.0 and v1.1 (January 2026), model-specific time horizon estimates shifted substantially without corresponding capability changes: GPT-4 1106 dropped 57% while GPT-5 rose 55%. This ~50% volatility occurs between benchmark versions for the same models, suggesting the measurement instrument itself is unstable. This creates a governance problem: if safety thresholds are defined using benchmark scores (e.g., METR's 40-hour catastrophic risk threshold), but those scores shift 50%+ when the benchmark is updated, then governance decisions based on crossing specific thresholds become unreliable. The benchmark is measuring something real about capability, but the numerical calibration is not stable enough to support bright-line regulatory thresholds. This is distinct from the general problem of benchmarks becoming saturated or gamed—this is about version-to-version measurement instability of the same underlying capability.

2.3 KiB Raw Blame History

AI capability benchmarks exhibit 50% volatility between versions making governance thresholds derived from them unreliable moving targets

2.3 KiB

Raw Blame History