diff --git a/domains/ai-alignment/Anthropics RSP rollback under commercial pressure is the first empirical confirmation that binding safety commitments cannot survive the competitive dynamics of frontier AI development.md b/domains/ai-alignment/Anthropics RSP rollback under commercial pressure is the first empirical confirmation that binding safety commitments cannot survive the competitive dynamics of frontier AI development.md index 669cf4e1..b55594ab 100644 --- a/domains/ai-alignment/Anthropics RSP rollback under commercial pressure is the first empirical confirmation that binding safety commitments cannot survive the competitive dynamics of frontier AI development.md +++ b/domains/ai-alignment/Anthropics RSP rollback under commercial pressure is the first empirical confirmation that binding safety commitments cannot survive the competitive dynamics of frontier AI development.md @@ -39,6 +39,12 @@ METR's pre-deployment sabotage reviews of Anthropic models (March 2026: Claude O The response gap explains a deeper problem than commitment erosion: even if commitments held, there's no institutional infrastructure to coordinate response when prevention fails. Anthropic's RSP rollback is about prevention commitments weakening; Mengesha identifies that we lack response mechanisms entirely. The two failures compound — weak prevention plus absent response creates a system that cannot learn from failures. +### Additional Evidence (confirm) +*Source: [[2026-03-20-metr-modeling-assumptions-time-horizon-reliability]] | Added: 2026-03-23* + +METR's finding that their time horizon metric has 1.5-2x uncertainty for frontier models provides independent technical confirmation of Anthropic's RSP v3.0 admission that 'the science of model evaluation isn't well-developed enough.' Both organizations independently arrived at the same conclusion within two months: measurement tools are not ready for governance enforcement. + + Relevant Notes: diff --git a/inbox/queue/.extraction-debug/2026-03-20-metr-modeling-assumptions-time-horizon-reliability.json b/inbox/queue/.extraction-debug/2026-03-20-metr-modeling-assumptions-time-horizon-reliability.json new file mode 100644 index 00000000..9ac5d3de --- /dev/null +++ b/inbox/queue/.extraction-debug/2026-03-20-metr-modeling-assumptions-time-horizon-reliability.json @@ -0,0 +1,24 @@ +{ + "rejected_claims": [ + { + "filename": "capability-measurement-saturation-creates-governance-enforcement-gap-at-frontier.md", + "issues": [ + "missing_attribution_extractor" + ] + } + ], + "validation_stats": { + "total": 1, + "kept": 0, + "fixed": 1, + "rejected": 1, + "fixes_applied": [ + "capability-measurement-saturation-creates-governance-enforcement-gap-at-frontier.md:set_created:2026-03-23" + ], + "rejections": [ + "capability-measurement-saturation-creates-governance-enforcement-gap-at-frontier.md:missing_attribution_extractor" + ] + }, + "model": "anthropic/claude-sonnet-4.5", + "date": "2026-03-23" +} \ No newline at end of file diff --git a/inbox/queue/2026-03-20-metr-modeling-assumptions-time-horizon-reliability.md b/inbox/queue/2026-03-20-metr-modeling-assumptions-time-horizon-reliability.md index 0bdfbf1a..8151e67c 100644 --- a/inbox/queue/2026-03-20-metr-modeling-assumptions-time-horizon-reliability.md +++ b/inbox/queue/2026-03-20-metr-modeling-assumptions-time-horizon-reliability.md @@ -7,9 +7,13 @@ date: 2026-03-20 domain: ai-alignment secondary_domains: [] format: technical-note -status: unprocessed +status: enrichment priority: high tags: [metr, time-horizon, measurement-reliability, evaluation-saturation, Opus-4.6, modeling-uncertainty] +processed_by: theseus +processed_date: 2026-03-23 +enrichments_applied: ["Anthropics RSP rollback under commercial pressure is the first empirical confirmation that binding safety commitments cannot survive the competitive dynamics of frontier AI development.md"] +extraction_model: "anthropic/claude-sonnet-4.5" --- ## Content @@ -53,3 +57,14 @@ METR published a technical note (March 20, 2026 — 3 days before this session) PRIMARY CONNECTION: [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]] WHY ARCHIVED: Direct evidence that the primary capability measurement tool has 1.5-2x uncertainty at the frontier — governance cannot set enforceable thresholds on unmeasurable capabilities EXTRACTION HINT: The "measurement saturation" concept may deserve its own claim distinct from the scalable oversight degradation claim — it's about the measurement tools themselves failing, not the oversight mechanisms + + +## Key Facts +- METR published technical note on March 20, 2026 analyzing modeling assumption impacts on time horizon estimates +- Opus 4.6 shows 50% time horizon variation of approximately 1.5x across modeling choices +- Opus 4.6 shows 80% time horizon variation of approximately 2x across modeling choices +- Task length noise contributes 25-40% potential reduction in time horizon estimates +- Success rate curve modeling contributes up to 35% reduction in estimates +- Opus 4.6 shows 40% reduction when excluding public tasks, driven by RE-Bench performance +- Confidence interval for Opus 4.6's 50% time horizon spans 6-98 hours (16x range) +- Older models show smaller modeling assumption impact due to more data and less extrapolation