extract: 2026-03-26-anthropic-activating-asl3-protections

Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70>
2026-03-26 01:00:52 +00:00 · 2026-03-26 01:00:52 +00:00 · 3920f4be46
commit 3920f4be46
parent ec2cfc2e63
3 changed files with 30 additions and 11 deletions
--- a/domains/ai-alignment/pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md
+++ b/domains/ai-alignment/pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md
@ -144,6 +144,12 @@ METR's August 2025 research update provides specific quantification of the evalu

 Anthropic explicitly acknowledged that 'dangerous capability evaluations of AI models are inherently challenging, and as models approach our thresholds of concern, it takes longer to determine their status.' This is a frontier lab publicly stating that evaluation reliability degrades precisely when it matters most—near capability thresholds. The ASL-3 activation was triggered by this evaluation uncertainty rather than confirmed capability, suggesting governance frameworks are adapting to evaluation unreliability rather than solving it.

+### Additional Evidence (extend)
+*Source: [[2026-03-26-anthropic-activating-asl3-protections]] | Added: 2026-03-26*
+
+Anthropic's ASL-3 activation demonstrates that evaluation uncertainty compounds near capability thresholds: 'dangerous capability evaluations of AI models are inherently challenging, and as models approach our thresholds of concern, it takes longer to determine their status.' The Virology Capabilities Test showed 'steadily increasing' performance across model generations, but Anthropic could not definitively confirm whether Opus 4 crossed the threshold—they activated protections based on trend trajectory and inability to rule out crossing rather than confirmed measurement.
+
+



--- a/inbox/queue/.extraction-debug/2026-03-26-anthropic-activating-asl3-protections.json
+++ b/inbox/queue/.extraction-debug/2026-03-26-anthropic-activating-asl3-protections.json
@ -1,13 +1,13 @@
 {
  "rejected_claims": [
    {
-      "filename": "precautionary-ai-governance-triggers-protection-escalation-when-capability-evaluation-becomes-unreliable.md",
+      "filename": "precautionary-ai-governance-activates-safeguards-when-capability-evaluation-uncertainty-itself-triggers-escalation.md",
      "issues": [
        "missing_attribution_extractor"
      ]
    },
    {
-      "filename": "ai-safety-commitments-lack-independent-verification-creating-self-referential-accountability-that-cannot-detect-motivated-reasoning.md",
+      "filename": "ai-safety-governance-lacks-external-verification-creating-self-referential-accountability-where-labs-assess-their-own-compliance.md",
      "issues": [
        "missing_attribution_extractor"
      ]
@ -19,17 +19,17 @@
    "fixed": 7,
    "rejected": 2,
    "fixes_applied": [
-      "precautionary-ai-governance-triggers-protection-escalation-when-capability-evaluation-becomes-unreliable.md:set_created:2026-03-26",
-      "precautionary-ai-governance-triggers-protection-escalation-when-capability-evaluation-becomes-unreliable.md:stripped_wiki_link:voluntary-safety-pledges-cannot-survive-competitive-pressure",
-      "precautionary-ai-governance-triggers-protection-escalation-when-capability-evaluation-becomes-unreliable.md:stripped_wiki_link:safe-AI-development-requires-building-alignment-mechanisms-b",
-      "ai-safety-commitments-lack-independent-verification-creating-self-referential-accountability-that-cannot-detect-motivated-reasoning.md:set_created:2026-03-26",
-      "ai-safety-commitments-lack-independent-verification-creating-self-referential-accountability-that-cannot-detect-motivated-reasoning.md:stripped_wiki_link:voluntary-safety-pledges-cannot-survive-competitive-pressure",
-      "ai-safety-commitments-lack-independent-verification-creating-self-referential-accountability-that-cannot-detect-motivated-reasoning.md:stripped_wiki_link:Anthropics-RSP-rollback-under-commercial-pressure-is-the-fir",
-      "ai-safety-commitments-lack-independent-verification-creating-self-referential-accountability-that-cannot-detect-motivated-reasoning.md:stripped_wiki_link:AI-transparency-is-declining-not-improving-because-Stanford-"
+      "precautionary-ai-governance-activates-safeguards-when-capability-evaluation-uncertainty-itself-triggers-escalation.md:set_created:2026-03-26",
+      "precautionary-ai-governance-activates-safeguards-when-capability-evaluation-uncertainty-itself-triggers-escalation.md:stripped_wiki_link:pre-deployment-AI-evaluations-do-not-predict-real-world-risk",
+      "precautionary-ai-governance-activates-safeguards-when-capability-evaluation-uncertainty-itself-triggers-escalation.md:stripped_wiki_link:safe-AI-development-requires-building-alignment-mechanisms-b",
+      "ai-safety-governance-lacks-external-verification-creating-self-referential-accountability-where-labs-assess-their-own-compliance.md:set_created:2026-03-26",
+      "ai-safety-governance-lacks-external-verification-creating-self-referential-accountability-where-labs-assess-their-own-compliance.md:stripped_wiki_link:voluntary-safety-pledges-cannot-survive-competitive-pressure",
+      "ai-safety-governance-lacks-external-verification-creating-self-referential-accountability-where-labs-assess-their-own-compliance.md:stripped_wiki_link:AI-transparency-is-declining-not-improving-because-Stanford-",
+      "ai-safety-governance-lacks-external-verification-creating-self-referential-accountability-where-labs-assess-their-own-compliance.md:stripped_wiki_link:only-binding-regulation-with-enforcement-teeth-changes-front"
    ],
    "rejections": [
-      "precautionary-ai-governance-triggers-protection-escalation-when-capability-evaluation-becomes-unreliable.md:missing_attribution_extractor",
-      "ai-safety-commitments-lack-independent-verification-creating-self-referential-accountability-that-cannot-detect-motivated-reasoning.md:missing_attribution_extractor"
+      "precautionary-ai-governance-activates-safeguards-when-capability-evaluation-uncertainty-itself-triggers-escalation.md:missing_attribution_extractor",
+      "ai-safety-governance-lacks-external-verification-creating-self-referential-accountability-where-labs-assess-their-own-compliance.md:missing_attribution_extractor"
    ]
  },
  "model": "anthropic/claude-sonnet-4.5",
--- a/inbox/queue/2026-03-26-anthropic-activating-asl3-protections.md
+++ b/inbox/queue/2026-03-26-anthropic-activating-asl3-protections.md
@ -14,6 +14,10 @@ processed_by: theseus
 processed_date: 2026-03-26
 enrichments_applied: ["pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md"]
 extraction_model: "anthropic/claude-sonnet-4.5"
+processed_by: theseus
+processed_date: 2026-03-26
+enrichments_applied: ["pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md"]
+extraction_model: "anthropic/claude-sonnet-4.5"
 ---

 ## Content
@ -61,3 +65,12 @@ EXTRACTION HINT: Focus on the *logic* of precautionary activation (uncertainty t
 - Claude Sonnet 3.7 showed measurable participant uplift on CBRN weapon acquisition tasks compared to standard internet resources
 - Virology Capabilities Test performance had been steadily increasing over time across Claude model generations
 - Anthropic's RSP explicitly permits deployment under a higher standard than confirmed necessary
+
+
+## Key Facts
+- Claude Opus 4 was released in May 2025
+- ASL-3 protections were narrowly scoped to prevent assistance with extended end-to-end CBRN workflows
+- Claude Sonnet 3.7 showed measurable participant uplift on CBRN weapon acquisition tasks compared to standard internet resources
+- Virology Capabilities Test performance had been steadily increasing across Claude model generations
+- Anthropic's RSP explicitly permits deployment under a higher standard than confirmed necessary
+- This was the first Claude model that could not be positively confirmed as below ASL-3 thresholds