theseus: extract claims from 2026-03-26-metr-gpt5-evaluation-time-horizon
Some checks are pending
Sync Graph Data to teleo-app / sync (push) Waiting to run
Some checks are pending
Sync Graph Data to teleo-app / sync (push) Waiting to run
- Source: inbox/queue/2026-03-26-metr-gpt5-evaluation-time-horizon.md - Domain: ai-alignment - Claims: 2, Entities: 0 - Enrichments: 2 - Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5) Pentagon-Agent: Theseus <PIPELINE>
This commit is contained in:
parent
219826da16
commit
c4d2e2e131
2 changed files with 34 additions and 0 deletions
|
|
@ -0,0 +1,17 @@
|
|||
---
|
||||
type: claim
|
||||
domain: ai-alignment
|
||||
description: "METR's HCAST benchmark showed 50-57% shifts in time horizon estimates between v1.0 and v1.1 for the same models, independent of actual capability change"
|
||||
confidence: experimental
|
||||
source: METR GPT-5 evaluation report, HCAST v1.0 to v1.1 comparison
|
||||
created: 2026-04-04
|
||||
title: "AI capability benchmarks exhibit 50% volatility between versions making governance thresholds derived from them unreliable moving targets"
|
||||
agent: theseus
|
||||
scope: structural
|
||||
sourcer: "@METR_evals"
|
||||
related_claims: ["[[pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations]]"]
|
||||
---
|
||||
|
||||
# AI capability benchmarks exhibit 50% volatility between versions making governance thresholds derived from them unreliable moving targets
|
||||
|
||||
Between HCAST v1.0 and v1.1 (January 2026), model-specific time horizon estimates shifted substantially without corresponding capability changes: GPT-4 1106 dropped 57% while GPT-5 rose 55%. This ~50% volatility occurs between benchmark versions for the same models, suggesting the measurement instrument itself is unstable. This creates a governance problem: if safety thresholds are defined using benchmark scores (e.g., METR's 40-hour catastrophic risk threshold), but those scores shift 50%+ when the benchmark is updated, then governance decisions based on crossing specific thresholds become unreliable. The benchmark is measuring something real about capability, but the numerical calibration is not stable enough to support bright-line regulatory thresholds. This is distinct from the general problem of benchmarks becoming saturated or gamed—this is about version-to-version measurement instability of the same underlying capability.
|
||||
|
|
@ -0,0 +1,17 @@
|
|||
---
|
||||
type: claim
|
||||
domain: ai-alignment
|
||||
description: GPT-5's 2h17m time horizon versus METR's 40-hour threshold for serious concern suggests a substantial capability gap remains before autonomous research becomes catastrophic
|
||||
confidence: experimental
|
||||
source: METR GPT-5 evaluation, January 2026
|
||||
created: 2026-04-04
|
||||
title: "Current frontier models evaluate at ~17x below METR's catastrophic risk threshold for autonomous AI R&D capability"
|
||||
agent: theseus
|
||||
scope: causal
|
||||
sourcer: "@METR_evals"
|
||||
related_claims: ["[[safe AI development requires building alignment mechanisms before scaling capability]]", "[[three conditions gate AI takeover risk autonomy robotics and production chain control and current AI satisfies none of them which bounds near-term catastrophic risk despite superhuman cognitive capabilities]]"]
|
||||
---
|
||||
|
||||
# Current frontier models evaluate at ~17x below METR's catastrophic risk threshold for autonomous AI R&D capability
|
||||
|
||||
METR's formal evaluation of GPT-5 found a 50% time horizon of 2 hours 17 minutes on their HCAST task suite, compared to their stated threshold of 40 hours for 'strong concern level' regarding catastrophic risk from autonomous AI R&D, rogue replication, or strategic sabotage. This represents approximately a 17x gap between current capability and the threshold where METR believes heightened scrutiny is warranted. The evaluation also found the 80% time horizon below 8 hours (METR's lower 'heightened scrutiny' threshold). METR's conclusion was that GPT-5 is 'very unlikely to pose a catastrophic risk' via these autonomy pathways. This provides formal calibration of where current frontier models sit relative to one major evaluation framework's risk thresholds. However, this finding is specific to autonomous capability (what AI can do without human direction) and does not address misuse scenarios where humans direct capable models toward harmful ends—a distinction the evaluation does not explicitly reconcile with real-world incidents like the August 2025 cyberattack using aligned models.
|
||||
Loading…
Reference in a new issue