extract: 2026-01-29-metr-time-horizon-1-1
Some checks are pending
Sync Graph Data to teleo-app / sync (push) Waiting to run
Some checks are pending
Sync Graph Data to teleo-app / sync (push) Waiting to run
Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70>
This commit is contained in:
parent
889b9fd60a
commit
98d283e794
3 changed files with 56 additions and 1 deletions
|
|
@ -104,6 +104,12 @@ IAISR 2026 states that 'pre-deployment testing increasingly fails to predict rea
|
||||||
|
|
||||||
Anthropic's explicit admission that 'the science of model evaluation isn't well-developed enough to provide definitive threshold assessments' is direct confirmation from a frontier lab that evaluation tools are insufficient for governance. This aligns with METR's March 2026 modeling assumptions note, suggesting field-wide consensus that current evaluation science cannot support the governance structures built on top of it.
|
Anthropic's explicit admission that 'the science of model evaluation isn't well-developed enough to provide definitive threshold assessments' is direct confirmation from a frontier lab that evaluation tools are insufficient for governance. This aligns with METR's March 2026 modeling assumptions note, suggesting field-wide consensus that current evaluation science cannot support the governance structures built on top of it.
|
||||||
|
|
||||||
|
### Additional Evidence (extend)
|
||||||
|
*Source: [[2026-01-29-metr-time-horizon-1-1]] | Added: 2026-03-24*
|
||||||
|
|
||||||
|
METR's scaffold sensitivity finding (GPT-4o and o3 performing better under Vivaria than Inspect) adds a new dimension to evaluation unreliability: the same model produces different capability estimates depending on evaluation infrastructure, introducing cross-model comparison uncertainty that governance frameworks do not account for.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -0,0 +1,33 @@
|
||||||
|
{
|
||||||
|
"rejected_claims": [
|
||||||
|
{
|
||||||
|
"filename": "metr-time-horizon-benchmark-saturating-at-governance-relevant-capability-levels.md",
|
||||||
|
"issues": [
|
||||||
|
"missing_attribution_extractor"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"filename": "ai-capability-evaluation-scaffold-sensitivity-introduces-cross-model-comparison-uncertainty.md",
|
||||||
|
"issues": [
|
||||||
|
"missing_attribution_extractor"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"validation_stats": {
|
||||||
|
"total": 2,
|
||||||
|
"kept": 0,
|
||||||
|
"fixed": 3,
|
||||||
|
"rejected": 2,
|
||||||
|
"fixes_applied": [
|
||||||
|
"metr-time-horizon-benchmark-saturating-at-governance-relevant-capability-levels.md:set_created:2026-03-24",
|
||||||
|
"metr-time-horizon-benchmark-saturating-at-governance-relevant-capability-levels.md:stripped_wiki_link:verification degrades faster than capability grows",
|
||||||
|
"ai-capability-evaluation-scaffold-sensitivity-introduces-cross-model-comparison-uncertainty.md:set_created:2026-03-24"
|
||||||
|
],
|
||||||
|
"rejections": [
|
||||||
|
"metr-time-horizon-benchmark-saturating-at-governance-relevant-capability-levels.md:missing_attribution_extractor",
|
||||||
|
"ai-capability-evaluation-scaffold-sensitivity-introduces-cross-model-comparison-uncertainty.md:missing_attribution_extractor"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"model": "anthropic/claude-sonnet-4.5",
|
||||||
|
"date": "2026-03-24"
|
||||||
|
}
|
||||||
|
|
@ -7,9 +7,13 @@ date: 2026-01-29
|
||||||
domain: ai-alignment
|
domain: ai-alignment
|
||||||
secondary_domains: []
|
secondary_domains: []
|
||||||
format: research-report
|
format: research-report
|
||||||
status: unprocessed
|
status: enrichment
|
||||||
priority: high
|
priority: high
|
||||||
tags: [metr, time-horizon, capability-evaluation, task-saturation, measurement, frontier-ai, benchmark]
|
tags: [metr, time-horizon, capability-evaluation, task-saturation, measurement, frontier-ai, benchmark]
|
||||||
|
processed_by: theseus
|
||||||
|
processed_date: 2026-03-24
|
||||||
|
enrichments_applied: ["pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md"]
|
||||||
|
extraction_model: "anthropic/claude-sonnet-4.5"
|
||||||
---
|
---
|
||||||
|
|
||||||
## Content
|
## Content
|
||||||
|
|
@ -68,3 +72,15 @@ PRIMARY CONNECTION: [[verification degrades faster than capability grows]]
|
||||||
WHY ARCHIVED: TH1.1 provides the empirical grounding for "131-day doubling time" and simultaneously the evidence that the measurement tool tracking that doubling is saturating. The saturation acknowledgment from METR itself is the most reliable source for this claim.
|
WHY ARCHIVED: TH1.1 provides the empirical grounding for "131-day doubling time" and simultaneously the evidence that the measurement tool tracking that doubling is saturating. The saturation acknowledgment from METR itself is the most reliable source for this claim.
|
||||||
|
|
||||||
EXTRACTION HINT: The extractor should distinguish between two separate findings: (1) capability is doubling every 131 days — this is a finding; (2) the measurement tool for this doubling is saturating — this is also a finding. Both can be true simultaneously and both deserve separate KB claims. The saturation finding specifically challenges the reliability of the doubling-time estimate itself.
|
EXTRACTION HINT: The extractor should distinguish between two separate findings: (1) capability is doubling every 131 days — this is a finding; (2) the measurement tool for this doubling is saturating — this is also a finding. Both can be true simultaneously and both deserve separate KB claims. The saturation finding specifically challenges the reliability of the doubling-time estimate itself.
|
||||||
|
|
||||||
|
|
||||||
|
## Key Facts
|
||||||
|
- METR's full historical trend (2019-2025) estimates 196-day capability doubling time
|
||||||
|
- METR's TH1.1 estimates 131-day capability doubling since 2023 (20% faster than previous 165-day estimate)
|
||||||
|
- METR's TH1.1 estimates 89-day capability doubling since 2024
|
||||||
|
- Claude Opus 4.5 achieved 320-minute (5.3 hour) time horizon in TH1.1
|
||||||
|
- GPT-5 achieved 214-minute time horizon in TH1.1
|
||||||
|
- o3 achieved 121-minute time horizon in TH1.1
|
||||||
|
- METR doubled long-duration tasks from 14 to 31 in TH1.1
|
||||||
|
- Only 5 of 31 long tasks in TH1.1 have actual human baseline times
|
||||||
|
- GPT-4 variants saw 35-57% downward revisions in TH1.1 estimates
|
||||||
|
|
|
||||||
Loading…
Reference in a new issue