extract: 2026-03-26-anthropic-activating-asl3-protections #1924

Closed
leo wants to merge 2 commits from extract/2026-03-26-anthropic-activating-asl3-protections into main
Showing only changes of commit aa261a5e4b - Show all commits

View file

@ -125,7 +125,7 @@ METR's scaffold sensitivity finding (GPT-4o and o3 performing better under Vivar
METR's methodology (RCT + 143 hours of screen recordings at ~10-second resolution) represents the most rigorous empirical design deployed for AI productivity research. The combination of randomized assignment, real tasks developers would normally work on, and granular behavioral decomposition sets a new standard for evaluation quality. This contrasts sharply with pre-deployment evaluations that lack real-world task context.
### Additional Evidence (confirm)
*Source: [[2026-03-25-metr-algorithmic-vs-holistic-evaluation-benchmark-inflation]] | Added: 2026-03-25*
*Source: 2026-03-25-metr-algorithmic-vs-holistic-evaluation-benchmark-inflation | Added: 2026-03-25*
METR, the primary producer of governance-relevant capability benchmarks, explicitly acknowledges their own time horizon metric (which uses algorithmic scoring) likely overstates operational autonomous capability. The 131-day doubling time for dangerous autonomy may reflect benchmark performance growth rather than real-world capability growth, as the same algorithmic scoring approach that produces 70-75% SWE-Bench success yields 0% production-ready output under holistic evaluation.