extract: 2026-03-26-anthropic-activating-asl3-protections
Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70>
This commit is contained in:
parent
517128bda1
commit
4ba0a55160
3 changed files with 56 additions and 1 deletions
|
|
@ -139,6 +139,12 @@ METR's January 2026 evaluation of GPT-5 placed its autonomous replication and ad
|
||||||
|
|
||||||
METR's August 2025 research update provides specific quantification of the evaluation reliability problem: algorithmic scoring overstates capability by 2-3x (38% algorithmic success vs 0% holistic success for Claude 3.7 Sonnet on software tasks), and HCAST benchmark version instability of ~50% between annual versions means even the measurement instrument itself is unstable. METR explicitly acknowledges their own evaluations 'may substantially overestimate' real-world capability.
|
METR's August 2025 research update provides specific quantification of the evaluation reliability problem: algorithmic scoring overstates capability by 2-3x (38% algorithmic success vs 0% holistic success for Claude 3.7 Sonnet on software tasks), and HCAST benchmark version instability of ~50% between annual versions means even the measurement instrument itself is unstable. METR explicitly acknowledges their own evaluations 'may substantially overestimate' real-world capability.
|
||||||
|
|
||||||
|
### Additional Evidence (extend)
|
||||||
|
*Source: [[2026-03-26-anthropic-activating-asl3-protections]] | Added: 2026-03-26*
|
||||||
|
|
||||||
|
Anthropic explicitly acknowledged that 'dangerous capability evaluations of AI models are inherently challenging, and as models approach our thresholds of concern, it takes longer to determine their status.' This is a frontier lab publicly stating that evaluation reliability degrades precisely when it matters most—near capability thresholds. The ASL-3 activation was triggered by this evaluation uncertainty rather than confirmed capability, suggesting governance frameworks are adapting to evaluation unreliability rather than solving it.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -0,0 +1,37 @@
|
||||||
|
{
|
||||||
|
"rejected_claims": [
|
||||||
|
{
|
||||||
|
"filename": "precautionary-ai-governance-triggers-protection-escalation-when-capability-evaluation-becomes-unreliable.md",
|
||||||
|
"issues": [
|
||||||
|
"missing_attribution_extractor"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"filename": "ai-safety-commitments-lack-independent-verification-creating-self-referential-accountability-that-cannot-detect-motivated-reasoning.md",
|
||||||
|
"issues": [
|
||||||
|
"missing_attribution_extractor"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"validation_stats": {
|
||||||
|
"total": 2,
|
||||||
|
"kept": 0,
|
||||||
|
"fixed": 7,
|
||||||
|
"rejected": 2,
|
||||||
|
"fixes_applied": [
|
||||||
|
"precautionary-ai-governance-triggers-protection-escalation-when-capability-evaluation-becomes-unreliable.md:set_created:2026-03-26",
|
||||||
|
"precautionary-ai-governance-triggers-protection-escalation-when-capability-evaluation-becomes-unreliable.md:stripped_wiki_link:voluntary-safety-pledges-cannot-survive-competitive-pressure",
|
||||||
|
"precautionary-ai-governance-triggers-protection-escalation-when-capability-evaluation-becomes-unreliable.md:stripped_wiki_link:safe-AI-development-requires-building-alignment-mechanisms-b",
|
||||||
|
"ai-safety-commitments-lack-independent-verification-creating-self-referential-accountability-that-cannot-detect-motivated-reasoning.md:set_created:2026-03-26",
|
||||||
|
"ai-safety-commitments-lack-independent-verification-creating-self-referential-accountability-that-cannot-detect-motivated-reasoning.md:stripped_wiki_link:voluntary-safety-pledges-cannot-survive-competitive-pressure",
|
||||||
|
"ai-safety-commitments-lack-independent-verification-creating-self-referential-accountability-that-cannot-detect-motivated-reasoning.md:stripped_wiki_link:Anthropics-RSP-rollback-under-commercial-pressure-is-the-fir",
|
||||||
|
"ai-safety-commitments-lack-independent-verification-creating-self-referential-accountability-that-cannot-detect-motivated-reasoning.md:stripped_wiki_link:AI-transparency-is-declining-not-improving-because-Stanford-"
|
||||||
|
],
|
||||||
|
"rejections": [
|
||||||
|
"precautionary-ai-governance-triggers-protection-escalation-when-capability-evaluation-becomes-unreliable.md:missing_attribution_extractor",
|
||||||
|
"ai-safety-commitments-lack-independent-verification-creating-self-referential-accountability-that-cannot-detect-motivated-reasoning.md:missing_attribution_extractor"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"model": "anthropic/claude-sonnet-4.5",
|
||||||
|
"date": "2026-03-26"
|
||||||
|
}
|
||||||
|
|
@ -7,9 +7,13 @@ date: 2025-05-01
|
||||||
domain: ai-alignment
|
domain: ai-alignment
|
||||||
secondary_domains: []
|
secondary_domains: []
|
||||||
format: blog
|
format: blog
|
||||||
status: unprocessed
|
status: enrichment
|
||||||
priority: high
|
priority: high
|
||||||
tags: [ASL-3, precautionary-governance, CBRN, capability-thresholds, RSP, measurement-uncertainty, safety-cases]
|
tags: [ASL-3, precautionary-governance, CBRN, capability-thresholds, RSP, measurement-uncertainty, safety-cases]
|
||||||
|
processed_by: theseus
|
||||||
|
processed_date: 2026-03-26
|
||||||
|
enrichments_applied: ["pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md"]
|
||||||
|
extraction_model: "anthropic/claude-sonnet-4.5"
|
||||||
---
|
---
|
||||||
|
|
||||||
## Content
|
## Content
|
||||||
|
|
@ -49,3 +53,11 @@ ASL-3 protections were narrowly scoped: preventing assistance with extended, end
|
||||||
PRIMARY CONNECTION: [[voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints]]
|
PRIMARY CONNECTION: [[voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints]]
|
||||||
WHY ARCHIVED: First documented precautionary capability threshold activation — governance acting before measurement confirmation rather than after
|
WHY ARCHIVED: First documented precautionary capability threshold activation — governance acting before measurement confirmation rather than after
|
||||||
EXTRACTION HINT: Focus on the *logic* of precautionary activation (uncertainty triggers more caution) as the claim, not just the CBRN specifics — the governance principle generalizes
|
EXTRACTION HINT: Focus on the *logic* of precautionary activation (uncertainty triggers more caution) as the claim, not just the CBRN specifics — the governance principle generalizes
|
||||||
|
|
||||||
|
|
||||||
|
## Key Facts
|
||||||
|
- Claude Opus 4 was the first Claude model that could not be positively confirmed as below ASL-3 thresholds
|
||||||
|
- ASL-3 protections were narrowly scoped to prevent assistance with extended end-to-end CBRN workflows
|
||||||
|
- Claude Sonnet 3.7 showed measurable participant uplift on CBRN weapon acquisition tasks compared to standard internet resources
|
||||||
|
- Virology Capabilities Test performance had been steadily increasing over time across Claude model generations
|
||||||
|
- Anthropic's RSP explicitly permits deployment under a higher standard than confirmed necessary
|
||||||
|
|
|
||||||
Loading…
Reference in a new issue