extract: 2026-03-26-anthropic-activating-asl3-protections
Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70>
This commit is contained in:
parent
95a316e4fb
commit
108a0d631c
5 changed files with 43 additions and 12 deletions
|
|
@ -40,6 +40,12 @@ STREAM framework proposes standardized ChemBio evaluation reporting with 23-expe
|
|||
|
||||
AISLE's autonomous discovery of 12 OpenSSL CVEs including a 30-year-old bug demonstrates that AI also lowers the expertise barrier for offensive cyber from specialized security researcher to automated system. Unlike bioweapons, zero-day discovery is also a defensive capability, but the dual-use nature means the same autonomous system that defends can be redirected offensively. The fact that this capability is already deployed commercially while governance frameworks haven't incorporated it suggests the expertise-barrier-lowering dynamic extends beyond bio to cyber domains.
|
||||
|
||||
### Additional Evidence (confirm)
|
||||
*Source: [[2026-03-26-anthropic-activating-asl3-protections]] | Added: 2026-03-26*
|
||||
|
||||
Anthropic's decision to activate ASL-3 protections was driven by evidence that Claude Sonnet 3.7 showed 'measurably better' performance on CBRN weapon acquisition tasks compared to standard internet resources, and that Virology Capabilities Test performance had been 'steadily increasing over time' across Claude model generations. This provides empirical confirmation that the expertise barrier is lowering in practice, not just theory, and that the trend is consistent enough to justify precautionary governance action.
|
||||
|
||||
|
||||
|
||||
Relevant Notes:
|
||||
- [[emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive]] — Amodei's admission of Claude exhibiting deception and subversion during testing is a concrete instance of this pattern, with bioweapon implications
|
||||
|
|
|
|||
|
|
@ -154,6 +154,12 @@ METR's August 2025 research update provides specific quantification of the evalu
|
|||
|
||||
Anthropic explicitly acknowledged that 'dangerous capability evaluations of AI models are inherently challenging, and as models approach our thresholds of concern, it takes longer to determine their status.' This is a frontier lab publicly stating that evaluation reliability degrades precisely when it matters most—near capability thresholds. The ASL-3 activation was triggered by this evaluation uncertainty rather than confirmed capability, suggesting governance frameworks are adapting to evaluation unreliability rather than solving it.
|
||||
|
||||
### Additional Evidence (extend)
|
||||
*Source: [[2026-03-26-anthropic-activating-asl3-protections]] | Added: 2026-03-26*
|
||||
|
||||
Anthropic's ASL-3 activation explicitly acknowledges that 'dangerous capability evaluations of AI models are inherently challenging, and as models approach our thresholds of concern, it takes longer to determine their status.' This is the first public admission from a frontier lab that evaluation reliability degrades near capability thresholds, creating a zone where governance must operate under irreducible uncertainty. The activation proceeded despite being unable to 'clearly rule out ASL-3 risks' in the way previous models could be confirmed safe, demonstrating that the evaluation limitation is not theoretical but operationally binding.
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
|
|
|||
|
|
@ -68,6 +68,12 @@ The coordination gap provides the mechanism explaining why voluntary commitments
|
|||
|
||||
RepliBench exists as a comprehensive self-replication evaluation tool but is not integrated into compliance frameworks despite EU AI Act Article 55 taking effect after its publication. Labs can voluntarily use it but face no enforcement mechanism requiring them to do so, creating competitive pressure to avoid evaluations that might reveal concerning capabilities.
|
||||
|
||||
### Additional Evidence (challenge)
|
||||
*Source: [[2026-03-26-anthropic-activating-asl3-protections]] | Added: 2026-03-26*
|
||||
|
||||
Anthropic maintained its ASL-3 commitment through precautionary activation despite commercial pressure to deploy Claude Opus 4 without additional constraints. This is a counter-example to the claim that voluntary commitments inevitably collapse under competition. However, the commitment was maintained through a narrow scoping of protections (only 'extended, end-to-end CBRN workflows') and the activation occurred in May 2025, before the RSP v3.0 rollback documented in February 2026. The temporal sequence suggests the commitment held temporarily but may have contributed to competitive pressure that later forced the RSP weakening.
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
|
|
|||
|
|
@ -1,13 +1,13 @@
|
|||
{
|
||||
"rejected_claims": [
|
||||
{
|
||||
"filename": "precautionary-ai-governance-triggers-protection-escalation-when-capability-evaluation-becomes-unreliable.md",
|
||||
"filename": "precautionary-ai-governance-triggers-protection-escalation-when-capability-evaluation-becomes-unreliable-near-thresholds.md",
|
||||
"issues": [
|
||||
"missing_attribution_extractor"
|
||||
]
|
||||
},
|
||||
{
|
||||
"filename": "ai-safety-commitments-lack-independent-verification-creating-self-referential-accountability-that-cannot-detect-motivated-reasoning.md",
|
||||
"filename": "ai-safety-governance-lacks-independent-verification-creating-self-referential-accountability-where-labs-assess-their-own-compliance.md",
|
||||
"issues": [
|
||||
"missing_attribution_extractor"
|
||||
]
|
||||
|
|
@ -16,20 +16,21 @@
|
|||
"validation_stats": {
|
||||
"total": 2,
|
||||
"kept": 0,
|
||||
"fixed": 7,
|
||||
"fixed": 8,
|
||||
"rejected": 2,
|
||||
"fixes_applied": [
|
||||
"precautionary-ai-governance-triggers-protection-escalation-when-capability-evaluation-becomes-unreliable.md:set_created:2026-03-26",
|
||||
"precautionary-ai-governance-triggers-protection-escalation-when-capability-evaluation-becomes-unreliable.md:stripped_wiki_link:voluntary-safety-pledges-cannot-survive-competitive-pressure",
|
||||
"precautionary-ai-governance-triggers-protection-escalation-when-capability-evaluation-becomes-unreliable.md:stripped_wiki_link:safe-AI-development-requires-building-alignment-mechanisms-b",
|
||||
"ai-safety-commitments-lack-independent-verification-creating-self-referential-accountability-that-cannot-detect-motivated-reasoning.md:set_created:2026-03-26",
|
||||
"ai-safety-commitments-lack-independent-verification-creating-self-referential-accountability-that-cannot-detect-motivated-reasoning.md:stripped_wiki_link:voluntary-safety-pledges-cannot-survive-competitive-pressure",
|
||||
"ai-safety-commitments-lack-independent-verification-creating-self-referential-accountability-that-cannot-detect-motivated-reasoning.md:stripped_wiki_link:Anthropics-RSP-rollback-under-commercial-pressure-is-the-fir",
|
||||
"ai-safety-commitments-lack-independent-verification-creating-self-referential-accountability-that-cannot-detect-motivated-reasoning.md:stripped_wiki_link:AI-transparency-is-declining-not-improving-because-Stanford-"
|
||||
"precautionary-ai-governance-triggers-protection-escalation-when-capability-evaluation-becomes-unreliable-near-thresholds.md:set_created:2026-03-26",
|
||||
"precautionary-ai-governance-triggers-protection-escalation-when-capability-evaluation-becomes-unreliable-near-thresholds.md:stripped_wiki_link:pre-deployment-AI-evaluations-do-not-predict-real-world-risk",
|
||||
"precautionary-ai-governance-triggers-protection-escalation-when-capability-evaluation-becomes-unreliable-near-thresholds.md:stripped_wiki_link:voluntary safety pledges cannot survive competitive pressure",
|
||||
"precautionary-ai-governance-triggers-protection-escalation-when-capability-evaluation-becomes-unreliable-near-thresholds.md:stripped_wiki_link:safe AI development requires building alignment mechanisms b",
|
||||
"ai-safety-governance-lacks-independent-verification-creating-self-referential-accountability-where-labs-assess-their-own-compliance.md:set_created:2026-03-26",
|
||||
"ai-safety-governance-lacks-independent-verification-creating-self-referential-accountability-where-labs-assess-their-own-compliance.md:stripped_wiki_link:AI transparency is declining not improving because Stanford ",
|
||||
"ai-safety-governance-lacks-independent-verification-creating-self-referential-accountability-where-labs-assess-their-own-compliance.md:stripped_wiki_link:only binding regulation with enforcement teeth changes front",
|
||||
"ai-safety-governance-lacks-independent-verification-creating-self-referential-accountability-where-labs-assess-their-own-compliance.md:stripped_wiki_link:voluntary safety pledges cannot survive competitive pressure"
|
||||
],
|
||||
"rejections": [
|
||||
"precautionary-ai-governance-triggers-protection-escalation-when-capability-evaluation-becomes-unreliable.md:missing_attribution_extractor",
|
||||
"ai-safety-commitments-lack-independent-verification-creating-self-referential-accountability-that-cannot-detect-motivated-reasoning.md:missing_attribution_extractor"
|
||||
"precautionary-ai-governance-triggers-protection-escalation-when-capability-evaluation-becomes-unreliable-near-thresholds.md:missing_attribution_extractor",
|
||||
"ai-safety-governance-lacks-independent-verification-creating-self-referential-accountability-where-labs-assess-their-own-compliance.md:missing_attribution_extractor"
|
||||
]
|
||||
},
|
||||
"model": "anthropic/claude-sonnet-4.5",
|
||||
|
|
|
|||
|
|
@ -14,6 +14,10 @@ processed_by: theseus
|
|||
processed_date: 2026-03-26
|
||||
enrichments_applied: ["pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md"]
|
||||
extraction_model: "anthropic/claude-sonnet-4.5"
|
||||
processed_by: theseus
|
||||
processed_date: 2026-03-26
|
||||
enrichments_applied: ["pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md", "voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints.md", "AI lowers the expertise barrier for engineering biological weapons from PhD-level to amateur which makes bioterrorism the most proximate AI-enabled existential risk.md"]
|
||||
extraction_model: "anthropic/claude-sonnet-4.5"
|
||||
---
|
||||
|
||||
## Content
|
||||
|
|
@ -61,3 +65,11 @@ EXTRACTION HINT: Focus on the *logic* of precautionary activation (uncertainty t
|
|||
- Claude Sonnet 3.7 showed measurable participant uplift on CBRN weapon acquisition tasks compared to standard internet resources
|
||||
- Virology Capabilities Test performance had been steadily increasing over time across Claude model generations
|
||||
- Anthropic's RSP explicitly permits deployment under a higher standard than confirmed necessary
|
||||
|
||||
|
||||
## Key Facts
|
||||
- Claude Opus 4 was deployed with ASL-3 protections in May 2025
|
||||
- Claude Sonnet 3.7 showed measurable uplift on CBRN weapon acquisition tasks compared to internet resources
|
||||
- Virology Capabilities Test performance increased steadily across Claude model generations
|
||||
- ASL-3 protections were scoped to prevent assistance with extended end-to-end CBRN workflows
|
||||
- Anthropic's RSP explicitly permits deployment under higher standards than confirmed necessary
|
||||
|
|
|
|||
Loading…
Reference in a new issue