From 108a0d631c6e7daaae68193b95ec06a392af421e Mon Sep 17 00:00:00 2001 From: Teleo Agents Date: Thu, 26 Mar 2026 03:02:27 +0000 Subject: [PATCH] extract: 2026-03-26-anthropic-activating-asl3-protections Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70> --- ...t proximate AI-enabled existential risk.md | 6 +++++ ...ernance-built-on-unreliable-foundations.md | 6 +++++ ... advance without equivalent constraints.md | 6 +++++ ...anthropic-activating-asl3-protections.json | 25 ++++++++++--------- ...6-anthropic-activating-asl3-protections.md | 12 +++++++++ 5 files changed, 43 insertions(+), 12 deletions(-) diff --git a/domains/ai-alignment/AI lowers the expertise barrier for engineering biological weapons from PhD-level to amateur which makes bioterrorism the most proximate AI-enabled existential risk.md b/domains/ai-alignment/AI lowers the expertise barrier for engineering biological weapons from PhD-level to amateur which makes bioterrorism the most proximate AI-enabled existential risk.md index 7ab0b2e4..778c86e7 100644 --- a/domains/ai-alignment/AI lowers the expertise barrier for engineering biological weapons from PhD-level to amateur which makes bioterrorism the most proximate AI-enabled existential risk.md +++ b/domains/ai-alignment/AI lowers the expertise barrier for engineering biological weapons from PhD-level to amateur which makes bioterrorism the most proximate AI-enabled existential risk.md @@ -40,6 +40,12 @@ STREAM framework proposes standardized ChemBio evaluation reporting with 23-expe AISLE's autonomous discovery of 12 OpenSSL CVEs including a 30-year-old bug demonstrates that AI also lowers the expertise barrier for offensive cyber from specialized security researcher to automated system. Unlike bioweapons, zero-day discovery is also a defensive capability, but the dual-use nature means the same autonomous system that defends can be redirected offensively. The fact that this capability is already deployed commercially while governance frameworks haven't incorporated it suggests the expertise-barrier-lowering dynamic extends beyond bio to cyber domains. +### Additional Evidence (confirm) +*Source: [[2026-03-26-anthropic-activating-asl3-protections]] | Added: 2026-03-26* + +Anthropic's decision to activate ASL-3 protections was driven by evidence that Claude Sonnet 3.7 showed 'measurably better' performance on CBRN weapon acquisition tasks compared to standard internet resources, and that Virology Capabilities Test performance had been 'steadily increasing over time' across Claude model generations. This provides empirical confirmation that the expertise barrier is lowering in practice, not just theory, and that the trend is consistent enough to justify precautionary governance action. + + Relevant Notes: - [[emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive]] — Amodei's admission of Claude exhibiting deception and subversion during testing is a concrete instance of this pattern, with bioweapon implications diff --git a/domains/ai-alignment/pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md b/domains/ai-alignment/pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md index da6a1e34..e0b33dde 100644 --- a/domains/ai-alignment/pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md +++ b/domains/ai-alignment/pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md @@ -154,6 +154,12 @@ METR's August 2025 research update provides specific quantification of the evalu Anthropic explicitly acknowledged that 'dangerous capability evaluations of AI models are inherently challenging, and as models approach our thresholds of concern, it takes longer to determine their status.' This is a frontier lab publicly stating that evaluation reliability degrades precisely when it matters most—near capability thresholds. The ASL-3 activation was triggered by this evaluation uncertainty rather than confirmed capability, suggesting governance frameworks are adapting to evaluation unreliability rather than solving it. +### Additional Evidence (extend) +*Source: [[2026-03-26-anthropic-activating-asl3-protections]] | Added: 2026-03-26* + +Anthropic's ASL-3 activation explicitly acknowledges that 'dangerous capability evaluations of AI models are inherently challenging, and as models approach our thresholds of concern, it takes longer to determine their status.' This is the first public admission from a frontier lab that evaluation reliability degrades near capability thresholds, creating a zone where governance must operate under irreducible uncertainty. The activation proceeded despite being unable to 'clearly rule out ASL-3 risks' in the way previous models could be confirmed safe, demonstrating that the evaluation limitation is not theoretical but operationally binding. + + diff --git a/domains/ai-alignment/voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints.md b/domains/ai-alignment/voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints.md index 4e62f756..11d1a6b7 100644 --- a/domains/ai-alignment/voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints.md +++ b/domains/ai-alignment/voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints.md @@ -68,6 +68,12 @@ The coordination gap provides the mechanism explaining why voluntary commitments RepliBench exists as a comprehensive self-replication evaluation tool but is not integrated into compliance frameworks despite EU AI Act Article 55 taking effect after its publication. Labs can voluntarily use it but face no enforcement mechanism requiring them to do so, creating competitive pressure to avoid evaluations that might reveal concerning capabilities. +### Additional Evidence (challenge) +*Source: [[2026-03-26-anthropic-activating-asl3-protections]] | Added: 2026-03-26* + +Anthropic maintained its ASL-3 commitment through precautionary activation despite commercial pressure to deploy Claude Opus 4 without additional constraints. This is a counter-example to the claim that voluntary commitments inevitably collapse under competition. However, the commitment was maintained through a narrow scoping of protections (only 'extended, end-to-end CBRN workflows') and the activation occurred in May 2025, before the RSP v3.0 rollback documented in February 2026. The temporal sequence suggests the commitment held temporarily but may have contributed to competitive pressure that later forced the RSP weakening. + + diff --git a/inbox/queue/.extraction-debug/2026-03-26-anthropic-activating-asl3-protections.json b/inbox/queue/.extraction-debug/2026-03-26-anthropic-activating-asl3-protections.json index 20188d9b..6d345354 100644 --- a/inbox/queue/.extraction-debug/2026-03-26-anthropic-activating-asl3-protections.json +++ b/inbox/queue/.extraction-debug/2026-03-26-anthropic-activating-asl3-protections.json @@ -1,13 +1,13 @@ { "rejected_claims": [ { - "filename": "precautionary-ai-governance-triggers-protection-escalation-when-capability-evaluation-becomes-unreliable.md", + "filename": "precautionary-ai-governance-triggers-protection-escalation-when-capability-evaluation-becomes-unreliable-near-thresholds.md", "issues": [ "missing_attribution_extractor" ] }, { - "filename": "ai-safety-commitments-lack-independent-verification-creating-self-referential-accountability-that-cannot-detect-motivated-reasoning.md", + "filename": "ai-safety-governance-lacks-independent-verification-creating-self-referential-accountability-where-labs-assess-their-own-compliance.md", "issues": [ "missing_attribution_extractor" ] @@ -16,20 +16,21 @@ "validation_stats": { "total": 2, "kept": 0, - "fixed": 7, + "fixed": 8, "rejected": 2, "fixes_applied": [ - "precautionary-ai-governance-triggers-protection-escalation-when-capability-evaluation-becomes-unreliable.md:set_created:2026-03-26", - "precautionary-ai-governance-triggers-protection-escalation-when-capability-evaluation-becomes-unreliable.md:stripped_wiki_link:voluntary-safety-pledges-cannot-survive-competitive-pressure", - "precautionary-ai-governance-triggers-protection-escalation-when-capability-evaluation-becomes-unreliable.md:stripped_wiki_link:safe-AI-development-requires-building-alignment-mechanisms-b", - "ai-safety-commitments-lack-independent-verification-creating-self-referential-accountability-that-cannot-detect-motivated-reasoning.md:set_created:2026-03-26", - "ai-safety-commitments-lack-independent-verification-creating-self-referential-accountability-that-cannot-detect-motivated-reasoning.md:stripped_wiki_link:voluntary-safety-pledges-cannot-survive-competitive-pressure", - "ai-safety-commitments-lack-independent-verification-creating-self-referential-accountability-that-cannot-detect-motivated-reasoning.md:stripped_wiki_link:Anthropics-RSP-rollback-under-commercial-pressure-is-the-fir", - "ai-safety-commitments-lack-independent-verification-creating-self-referential-accountability-that-cannot-detect-motivated-reasoning.md:stripped_wiki_link:AI-transparency-is-declining-not-improving-because-Stanford-" + "precautionary-ai-governance-triggers-protection-escalation-when-capability-evaluation-becomes-unreliable-near-thresholds.md:set_created:2026-03-26", + "precautionary-ai-governance-triggers-protection-escalation-when-capability-evaluation-becomes-unreliable-near-thresholds.md:stripped_wiki_link:pre-deployment-AI-evaluations-do-not-predict-real-world-risk", + "precautionary-ai-governance-triggers-protection-escalation-when-capability-evaluation-becomes-unreliable-near-thresholds.md:stripped_wiki_link:voluntary safety pledges cannot survive competitive pressure", + "precautionary-ai-governance-triggers-protection-escalation-when-capability-evaluation-becomes-unreliable-near-thresholds.md:stripped_wiki_link:safe AI development requires building alignment mechanisms b", + "ai-safety-governance-lacks-independent-verification-creating-self-referential-accountability-where-labs-assess-their-own-compliance.md:set_created:2026-03-26", + "ai-safety-governance-lacks-independent-verification-creating-self-referential-accountability-where-labs-assess-their-own-compliance.md:stripped_wiki_link:AI transparency is declining not improving because Stanford ", + "ai-safety-governance-lacks-independent-verification-creating-self-referential-accountability-where-labs-assess-their-own-compliance.md:stripped_wiki_link:only binding regulation with enforcement teeth changes front", + "ai-safety-governance-lacks-independent-verification-creating-self-referential-accountability-where-labs-assess-their-own-compliance.md:stripped_wiki_link:voluntary safety pledges cannot survive competitive pressure" ], "rejections": [ - "precautionary-ai-governance-triggers-protection-escalation-when-capability-evaluation-becomes-unreliable.md:missing_attribution_extractor", - "ai-safety-commitments-lack-independent-verification-creating-self-referential-accountability-that-cannot-detect-motivated-reasoning.md:missing_attribution_extractor" + "precautionary-ai-governance-triggers-protection-escalation-when-capability-evaluation-becomes-unreliable-near-thresholds.md:missing_attribution_extractor", + "ai-safety-governance-lacks-independent-verification-creating-self-referential-accountability-where-labs-assess-their-own-compliance.md:missing_attribution_extractor" ] }, "model": "anthropic/claude-sonnet-4.5", diff --git a/inbox/queue/2026-03-26-anthropic-activating-asl3-protections.md b/inbox/queue/2026-03-26-anthropic-activating-asl3-protections.md index 080040b3..3bc2c9c5 100644 --- a/inbox/queue/2026-03-26-anthropic-activating-asl3-protections.md +++ b/inbox/queue/2026-03-26-anthropic-activating-asl3-protections.md @@ -14,6 +14,10 @@ processed_by: theseus processed_date: 2026-03-26 enrichments_applied: ["pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md"] extraction_model: "anthropic/claude-sonnet-4.5" +processed_by: theseus +processed_date: 2026-03-26 +enrichments_applied: ["pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md", "voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints.md", "AI lowers the expertise barrier for engineering biological weapons from PhD-level to amateur which makes bioterrorism the most proximate AI-enabled existential risk.md"] +extraction_model: "anthropic/claude-sonnet-4.5" --- ## Content @@ -61,3 +65,11 @@ EXTRACTION HINT: Focus on the *logic* of precautionary activation (uncertainty t - Claude Sonnet 3.7 showed measurable participant uplift on CBRN weapon acquisition tasks compared to standard internet resources - Virology Capabilities Test performance had been steadily increasing over time across Claude model generations - Anthropic's RSP explicitly permits deployment under a higher standard than confirmed necessary + + +## Key Facts +- Claude Opus 4 was deployed with ASL-3 protections in May 2025 +- Claude Sonnet 3.7 showed measurable uplift on CBRN weapon acquisition tasks compared to internet resources +- Virology Capabilities Test performance increased steadily across Claude model generations +- ASL-3 protections were scoped to prevent assistance with extended end-to-end CBRN workflows +- Anthropic's RSP explicitly permits deployment under higher standards than confirmed necessary