extract: 2026-03-26-anthropic-activating-asl3-protections

Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70>
This commit is contained in:
Teleo Agents 2026-03-26 00:31:58 +00:00
parent 290a0160ae
commit fcd3c793e2
3 changed files with 56 additions and 1 deletions

View file

@ -129,6 +129,12 @@ METR's methodology (RCT + 143 hours of screen recordings at ~10-second resolutio
METR, the primary producer of governance-relevant capability benchmarks, explicitly acknowledges their own time horizon metric (which uses algorithmic scoring) likely overstates operational autonomous capability. The 131-day doubling time for dangerous autonomy may reflect benchmark performance growth rather than real-world capability growth, as the same algorithmic scoring approach that produces 70-75% SWE-Bench success yields 0% production-ready output under holistic evaluation.
### Additional Evidence (extend)
*Source: [[2026-03-26-anthropic-activating-asl3-protections]] | Added: 2026-03-26*
Anthropic explicitly acknowledged that 'dangerous capability evaluations of AI models are inherently challenging, and as models approach our thresholds of concern, it takes longer to determine their status.' This evaluation unreliability near thresholds is precisely where governance decisions matter most, creating a structural problem: the governance framework depends on measurements that become less reliable at the decision boundary.

View file

@ -0,0 +1,37 @@
{
"rejected_claims": [
{
"filename": "precautionary-ai-governance-triggers-higher-protections-when-capability-evaluation-becomes-unreliable.md",
"issues": [
"missing_attribution_extractor"
]
},
{
"filename": "self-referential-ai-safety-commitments-lack-independent-verification-creating-accountability-gap.md",
"issues": [
"missing_attribution_extractor"
]
}
],
"validation_stats": {
"total": 2,
"kept": 0,
"fixed": 7,
"rejected": 2,
"fixes_applied": [
"precautionary-ai-governance-triggers-higher-protections-when-capability-evaluation-becomes-unreliable.md:set_created:2026-03-26",
"precautionary-ai-governance-triggers-higher-protections-when-capability-evaluation-becomes-unreliable.md:stripped_wiki_link:voluntary-safety-pledges-cannot-survive-competitive-pressure",
"precautionary-ai-governance-triggers-higher-protections-when-capability-evaluation-becomes-unreliable.md:stripped_wiki_link:safe-AI-development-requires-building-alignment-mechanisms-b",
"self-referential-ai-safety-commitments-lack-independent-verification-creating-accountability-gap.md:set_created:2026-03-26",
"self-referential-ai-safety-commitments-lack-independent-verification-creating-accountability-gap.md:stripped_wiki_link:voluntary-safety-pledges-cannot-survive-competitive-pressure",
"self-referential-ai-safety-commitments-lack-independent-verification-creating-accountability-gap.md:stripped_wiki_link:Anthropics-RSP-rollback-under-commercial-pressure-is-the-fir",
"self-referential-ai-safety-commitments-lack-independent-verification-creating-accountability-gap.md:stripped_wiki_link:AI-transparency-is-declining-not-improving-because-Stanford-"
],
"rejections": [
"precautionary-ai-governance-triggers-higher-protections-when-capability-evaluation-becomes-unreliable.md:missing_attribution_extractor",
"self-referential-ai-safety-commitments-lack-independent-verification-creating-accountability-gap.md:missing_attribution_extractor"
]
},
"model": "anthropic/claude-sonnet-4.5",
"date": "2026-03-26"
}

View file

@ -7,9 +7,13 @@ date: 2025-05-01
domain: ai-alignment
secondary_domains: []
format: blog
status: unprocessed
status: enrichment
priority: high
tags: [ASL-3, precautionary-governance, CBRN, capability-thresholds, RSP, measurement-uncertainty, safety-cases]
processed_by: theseus
processed_date: 2026-03-26
enrichments_applied: ["pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md"]
extraction_model: "anthropic/claude-sonnet-4.5"
---
## Content
@ -49,3 +53,11 @@ ASL-3 protections were narrowly scoped: preventing assistance with extended, end
PRIMARY CONNECTION: [[voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints]]
WHY ARCHIVED: First documented precautionary capability threshold activation — governance acting before measurement confirmation rather than after
EXTRACTION HINT: Focus on the *logic* of precautionary activation (uncertainty triggers more caution) as the claim, not just the CBRN specifics — the governance principle generalizes
## Key Facts
- Claude Opus 4 was the first Anthropic model that could not be positively confirmed as below ASL-3 thresholds
- ASL-3 protections were narrowly scoped to prevent assistance with extended end-to-end CBRN workflows
- Claude Sonnet 3.7 showed measurable uplift in CBRN weapon acquisition tasks compared to internet resources, though below formal thresholds
- Virology Capabilities Test performance had been steadily increasing over time across Claude model generations
- Anthropic's RSP explicitly permits deployment under higher standards than confirmed necessary