extract: 2026-03-26-metr-gpt5-evaluation-time-horizon

Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70>
2026-03-26 00:36:12 +00:00 · 2026-03-26 00:36:12 +00:00 · a931485003
commit a931485003
parent 99c7dc4ab7
3 changed files with 57 additions and 1 deletions
--- a/domains/ai-alignment/pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md
+++ b/domains/ai-alignment/pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md
@ -134,6 +134,12 @@ METR, the primary producer of governance-relevant capability benchmarks, explici

 METR's January 2026 evaluation of GPT-5 placed its autonomous replication and adaptation capability at 2h17m (50% time horizon), far below catastrophic risk thresholds. In the same month, AISLE (an AI system) autonomously discovered 12 OpenSSL CVEs including a 30-year-old bug through fully autonomous operation. This is direct evidence that formal pre-deployment evaluations are not capturing operational dangerous autonomy that is already deployed at commercial scale.

+### Additional Evidence (extend)
+*Source: [[2026-03-26-metr-gpt5-evaluation-time-horizon]] | Added: 2026-03-26*
+
+METR's HCAST benchmark showed 50-57% volatility in time horizon estimates between v1.0 and v1.1 for the same models (GPT-4 1106 dropped 57%, GPT-5 rose 55%), demonstrating that even the most rigorous pre-deployment evaluations have structural instability independent of actual capability changes. This extends the evaluation reliability critique from 'doesn't predict deployment risk' to 'cannot even maintain consistent capability measurements across benchmark versions.'
+
+



--- a/inbox/queue/.extraction-debug/2026-03-26-metr-gpt5-evaluation-time-horizon.json
+++ b/inbox/queue/.extraction-debug/2026-03-26-metr-gpt5-evaluation-time-horizon.json
@ -0,0 +1,35 @@
+{
+  "rejected_claims": [
+    {
+      "filename": "metr-time-horizon-benchmarks-show-50-percent-instability-between-versions-making-governance-thresholds-unreliable.md",
+      "issues": [
+        "missing_attribution_extractor"
+      ]
+    },
+    {
+      "filename": "current-frontier-models-evaluate-17x-below-metrs-catastrophic-autonomy-threshold-suggesting-substantial-safety-margin.md",
+      "issues": [
+        "missing_attribution_extractor"
+      ]
+    }
+  ],
+  "validation_stats": {
+    "total": 2,
+    "kept": 0,
+    "fixed": 5,
+    "rejected": 2,
+    "fixes_applied": [
+      "metr-time-horizon-benchmarks-show-50-percent-instability-between-versions-making-governance-thresholds-unreliable.md:set_created:2026-03-26",
+      "metr-time-horizon-benchmarks-show-50-percent-instability-between-versions-making-governance-thresholds-unreliable.md:stripped_wiki_link:pre-deployment-AI-evaluations-do-not-predict-real-world-risk",
+      "current-frontier-models-evaluate-17x-below-metrs-catastrophic-autonomy-threshold-suggesting-substantial-safety-margin.md:set_created:2026-03-26",
+      "current-frontier-models-evaluate-17x-below-metrs-catastrophic-autonomy-threshold-suggesting-substantial-safety-margin.md:stripped_wiki_link:three-conditions-gate-AI-takeover-risk-autonomy-robotics-and",
+      "current-frontier-models-evaluate-17x-below-metrs-catastrophic-autonomy-threshold-suggesting-substantial-safety-margin.md:stripped_wiki_link:capability-control-methods-are-temporary-at-best-because-a-s"
+    ],
+    "rejections": [
+      "metr-time-horizon-benchmarks-show-50-percent-instability-between-versions-making-governance-thresholds-unreliable.md:missing_attribution_extractor",
+      "current-frontier-models-evaluate-17x-below-metrs-catastrophic-autonomy-threshold-suggesting-substantial-safety-margin.md:missing_attribution_extractor"
+    ]
+  },
+  "model": "anthropic/claude-sonnet-4.5",
+  "date": "2026-03-26"
+}
--- a/inbox/queue/2026-03-26-metr-gpt5-evaluation-time-horizon.md
+++ b/inbox/queue/2026-03-26-metr-gpt5-evaluation-time-horizon.md
@ -7,9 +7,13 @@ date: 2026-01-01
 domain: ai-alignment
 secondary_domains: []
 format: report
-status: unprocessed
+status: enrichment
 priority: medium
 tags: [METR, GPT-5, time-horizon, capability-thresholds, safety-evaluation, holistic-evaluation, governance-thresholds, catastrophic-risk]
+processed_by: theseus
+processed_date: 2026-03-26
+enrichments_applied: ["pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md"]
+extraction_model: "anthropic/claude-sonnet-4.5"
 ---

 ## Content
@ -59,3 +63,14 @@ This suggests ~50% volatility in time horizon estimates between benchmark versio
 PRIMARY CONNECTION: [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]]
 WHY ARCHIVED: Provides formal numerical calibration of where current frontier models sit relative to governance thresholds — essential context for evaluating B1's "greatest outstanding problem" claim. The finding (2h17m vs 40-hour threshold) partially challenges alarmist interpretations while the 50%+ benchmark instability maintains the governance concern
 EXTRACTION HINT: Separate claims: (1) "Current frontier models evaluate at ~17x below METR's catastrophic risk threshold for autonomous AI R&D" — calibrating B1; (2) "METR's time horizon benchmark shifted 50-57% between v1.0 and v1.1 versions, making governance thresholds derived from it a moving target" — the reliability problem
+
+
+## Key Facts
+- GPT-5 achieved 50% time horizon of 2 hours 17 minutes on METR's HCAST evaluation
+- GPT-5's 80% time horizon was below 8 hours
+- METR's catastrophic risk threshold is 40 hours at 50% time horizon for software engineering/ML tasks
+- METR's heightened scrutiny threshold is 8 hours at 80% time horizon
+- HCAST v1.1 contains 228 tasks as of January 2026
+- GPT-4 1106's time horizon estimate dropped 57% between HCAST v1.0 and v1.1
+- GPT-5's time horizon estimate rose 55% between HCAST v1.0 and v1.1
+- METR evaluations are used by OpenAI, Anthropic, and other frontier labs for safety milestone assessments