extract: 2026-03-20-metr-modeling-assumptions-time-horizon-reliability
Some checks are pending
Sync Graph Data to teleo-app / sync (push) Waiting to run
Some checks are pending
Sync Graph Data to teleo-app / sync (push) Waiting to run
Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70>
This commit is contained in:
parent
8dedfd687e
commit
df33272fbd
3 changed files with 46 additions and 1 deletions
|
|
@ -39,6 +39,12 @@ METR's pre-deployment sabotage reviews of Anthropic models (March 2026: Claude O
|
|||
|
||||
The response gap explains a deeper problem than commitment erosion: even if commitments held, there's no institutional infrastructure to coordinate response when prevention fails. Anthropic's RSP rollback is about prevention commitments weakening; Mengesha identifies that we lack response mechanisms entirely. The two failures compound — weak prevention plus absent response creates a system that cannot learn from failures.
|
||||
|
||||
### Additional Evidence (confirm)
|
||||
*Source: [[2026-03-20-metr-modeling-assumptions-time-horizon-reliability]] | Added: 2026-03-23*
|
||||
|
||||
METR's finding that their time horizon metric has 1.5-2x uncertainty for frontier models provides independent technical confirmation of Anthropic's RSP v3.0 admission that 'the science of model evaluation isn't well-developed enough.' Both organizations independently arrived at the same conclusion within two months: measurement tools are not ready for governance enforcement.
|
||||
|
||||
|
||||
|
||||
|
||||
Relevant Notes:
|
||||
|
|
|
|||
|
|
@ -0,0 +1,24 @@
|
|||
{
|
||||
"rejected_claims": [
|
||||
{
|
||||
"filename": "capability-measurement-saturation-creates-governance-enforcement-gap-at-frontier.md",
|
||||
"issues": [
|
||||
"missing_attribution_extractor"
|
||||
]
|
||||
}
|
||||
],
|
||||
"validation_stats": {
|
||||
"total": 1,
|
||||
"kept": 0,
|
||||
"fixed": 1,
|
||||
"rejected": 1,
|
||||
"fixes_applied": [
|
||||
"capability-measurement-saturation-creates-governance-enforcement-gap-at-frontier.md:set_created:2026-03-23"
|
||||
],
|
||||
"rejections": [
|
||||
"capability-measurement-saturation-creates-governance-enforcement-gap-at-frontier.md:missing_attribution_extractor"
|
||||
]
|
||||
},
|
||||
"model": "anthropic/claude-sonnet-4.5",
|
||||
"date": "2026-03-23"
|
||||
}
|
||||
|
|
@ -7,9 +7,13 @@ date: 2026-03-20
|
|||
domain: ai-alignment
|
||||
secondary_domains: []
|
||||
format: technical-note
|
||||
status: unprocessed
|
||||
status: enrichment
|
||||
priority: high
|
||||
tags: [metr, time-horizon, measurement-reliability, evaluation-saturation, Opus-4.6, modeling-uncertainty]
|
||||
processed_by: theseus
|
||||
processed_date: 2026-03-23
|
||||
enrichments_applied: ["Anthropics RSP rollback under commercial pressure is the first empirical confirmation that binding safety commitments cannot survive the competitive dynamics of frontier AI development.md"]
|
||||
extraction_model: "anthropic/claude-sonnet-4.5"
|
||||
---
|
||||
|
||||
## Content
|
||||
|
|
@ -53,3 +57,14 @@ METR published a technical note (March 20, 2026 — 3 days before this session)
|
|||
PRIMARY CONNECTION: [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]]
|
||||
WHY ARCHIVED: Direct evidence that the primary capability measurement tool has 1.5-2x uncertainty at the frontier — governance cannot set enforceable thresholds on unmeasurable capabilities
|
||||
EXTRACTION HINT: The "measurement saturation" concept may deserve its own claim distinct from the scalable oversight degradation claim — it's about the measurement tools themselves failing, not the oversight mechanisms
|
||||
|
||||
|
||||
## Key Facts
|
||||
- METR published technical note on March 20, 2026 analyzing modeling assumption impacts on time horizon estimates
|
||||
- Opus 4.6 shows 50% time horizon variation of approximately 1.5x across modeling choices
|
||||
- Opus 4.6 shows 80% time horizon variation of approximately 2x across modeling choices
|
||||
- Task length noise contributes 25-40% potential reduction in time horizon estimates
|
||||
- Success rate curve modeling contributes up to 35% reduction in estimates
|
||||
- Opus 4.6 shows 40% reduction when excluding public tasks, driven by RE-Bench performance
|
||||
- Confidence interval for Opus 4.6's 50% time horizon spans 6-98 hours (16x range)
|
||||
- Older models show smaller modeling assumption impact due to more data and less extrapolation
|
||||
|
|
|
|||
Loading…
Reference in a new issue