extract: 2026-03-26-anthropic-activating-asl3-protections

Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70>
2026-03-26 00:31:58 +00:00 · 2026-03-26 00:31:58 +00:00 · fcd3c793e2
commit fcd3c793e2
parent 290a0160ae
3 changed files with 56 additions and 1 deletions
--- a/domains/ai-alignment/pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md
+++ b/domains/ai-alignment/pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md
@ -129,6 +129,12 @@ METR's methodology (RCT + 143 hours of screen recordings at ~10-second resolutio

 METR, the primary producer of governance-relevant capability benchmarks, explicitly acknowledges their own time horizon metric (which uses algorithmic scoring) likely overstates operational autonomous capability. The 131-day doubling time for dangerous autonomy may reflect benchmark performance growth rather than real-world capability growth, as the same algorithmic scoring approach that produces 70-75% SWE-Bench success yields 0% production-ready output under holistic evaluation.

+### Additional Evidence (extend)
+*Source: [[2026-03-26-anthropic-activating-asl3-protections]] | Added: 2026-03-26*
+
+Anthropic explicitly acknowledged that 'dangerous capability evaluations of AI models are inherently challenging, and as models approach our thresholds of concern, it takes longer to determine their status.' This evaluation unreliability near thresholds is precisely where governance decisions matter most, creating a structural problem: the governance framework depends on measurements that become less reliable at the decision boundary.
+
+



--- a/inbox/queue/.extraction-debug/2026-03-26-anthropic-activating-asl3-protections.json
+++ b/inbox/queue/.extraction-debug/2026-03-26-anthropic-activating-asl3-protections.json
@ -0,0 +1,37 @@
+{
+  "rejected_claims": [
+    {
+      "filename": "precautionary-ai-governance-triggers-higher-protections-when-capability-evaluation-becomes-unreliable.md",
+      "issues": [
+        "missing_attribution_extractor"
+      ]
+    },
+    {
+      "filename": "self-referential-ai-safety-commitments-lack-independent-verification-creating-accountability-gap.md",
+      "issues": [
+        "missing_attribution_extractor"
+      ]
+    }
+  ],
+  "validation_stats": {
+    "total": 2,
+    "kept": 0,
+    "fixed": 7,
+    "rejected": 2,
+    "fixes_applied": [
+      "precautionary-ai-governance-triggers-higher-protections-when-capability-evaluation-becomes-unreliable.md:set_created:2026-03-26",
+      "precautionary-ai-governance-triggers-higher-protections-when-capability-evaluation-becomes-unreliable.md:stripped_wiki_link:voluntary-safety-pledges-cannot-survive-competitive-pressure",
+      "precautionary-ai-governance-triggers-higher-protections-when-capability-evaluation-becomes-unreliable.md:stripped_wiki_link:safe-AI-development-requires-building-alignment-mechanisms-b",
+      "self-referential-ai-safety-commitments-lack-independent-verification-creating-accountability-gap.md:set_created:2026-03-26",
+      "self-referential-ai-safety-commitments-lack-independent-verification-creating-accountability-gap.md:stripped_wiki_link:voluntary-safety-pledges-cannot-survive-competitive-pressure",
+      "self-referential-ai-safety-commitments-lack-independent-verification-creating-accountability-gap.md:stripped_wiki_link:Anthropics-RSP-rollback-under-commercial-pressure-is-the-fir",
+      "self-referential-ai-safety-commitments-lack-independent-verification-creating-accountability-gap.md:stripped_wiki_link:AI-transparency-is-declining-not-improving-because-Stanford-"
+    ],
+    "rejections": [
+      "precautionary-ai-governance-triggers-higher-protections-when-capability-evaluation-becomes-unreliable.md:missing_attribution_extractor",
+      "self-referential-ai-safety-commitments-lack-independent-verification-creating-accountability-gap.md:missing_attribution_extractor"
+    ]
+  },
+  "model": "anthropic/claude-sonnet-4.5",
+  "date": "2026-03-26"
+}
--- a/inbox/queue/2026-03-26-anthropic-activating-asl3-protections.md
+++ b/inbox/queue/2026-03-26-anthropic-activating-asl3-protections.md
@ -7,9 +7,13 @@ date: 2025-05-01
 domain: ai-alignment
 secondary_domains: []
 format: blog
-status: unprocessed
+status: enrichment
 priority: high
 tags: [ASL-3, precautionary-governance, CBRN, capability-thresholds, RSP, measurement-uncertainty, safety-cases]
+processed_by: theseus
+processed_date: 2026-03-26
+enrichments_applied: ["pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md"]
+extraction_model: "anthropic/claude-sonnet-4.5"
 ---

 ## Content
@ -49,3 +53,11 @@ ASL-3 protections were narrowly scoped: preventing assistance with extended, end
 PRIMARY CONNECTION: [[voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints]]
 WHY ARCHIVED: First documented precautionary capability threshold activation — governance acting before measurement confirmation rather than after
 EXTRACTION HINT: Focus on the *logic* of precautionary activation (uncertainty triggers more caution) as the claim, not just the CBRN specifics — the governance principle generalizes
+
+
+## Key Facts
+- Claude Opus 4 was the first Anthropic model that could not be positively confirmed as below ASL-3 thresholds
+- ASL-3 protections were narrowly scoped to prevent assistance with extended end-to-end CBRN workflows
+- Claude Sonnet 3.7 showed measurable uplift in CBRN weapon acquisition tasks compared to internet resources, though below formal thresholds
+- Virology Capabilities Test performance had been steadily increasing over time across Claude model generations
+- Anthropic's RSP explicitly permits deployment under higher standards than confirmed necessary