extract: 2026-02-24-anthropic-rsp-v3-voluntary-safety-collapse

Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70>
entity-batch: update 1 entities
2026-03-23 00:32:06 +00:00 · 2026-03-23 00:31:32 +00:00
4 changed files with 58 additions and 1 deletions
--- a/domains/ai-alignment/pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md
+++ b/domains/ai-alignment/pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md
@ -99,6 +99,12 @@ METR recommended 'deeper investigations of evaluation awareness and obfuscated m

 IAISR 2026 states that 'pre-deployment testing increasingly fails to predict real-world model behavior,' providing authoritative international consensus confirmation that the evaluation-deployment gap is widening. The report explicitly connects this to dangerous capabilities going undetected, confirming the governance implications.

+### Additional Evidence (confirm)
+*Source: [[2026-02-24-anthropic-rsp-v3-voluntary-safety-collapse]] | Added: 2026-03-23*
+
+Anthropic's explicit admission that 'the science of model evaluation isn't well-developed enough to provide definitive threshold assessments' is direct confirmation from a frontier lab that evaluation tools are insufficient for governance. This aligns with METR's March 2026 modeling assumptions note, suggesting field-wide consensus that current evaluation science cannot support the governance structures built on top of it.
+
+



--- a/entities/ai-alignment/anthropic.md
+++ b/entities/ai-alignment/anthropic.md
@ -57,6 +57,7 @@ Frontier AI safety laboratory founded by former OpenAI VP of Research Dario Amod
 - **2026-03-06** — Overhauled Responsible Scaling Policy from 'never train without advance safety guarantees' to conditional delays only when Anthropic leads AND catastrophic risks are significant. Raised $30B at ~$380B valuation with 10x annual revenue growth. Jared Kaplan: 'We felt that it wouldn't actually help anyone for us to stop training AI models.'
 - **2026-02-24** — Released RSP v3.0, replacing unconditional binary safety thresholds with dual-condition escape clauses (pause only if Anthropic leads AND risks are catastrophic). METR partner Chris Painter warned of 'frog-boiling effect' from removing binary thresholds. Raised $30B at ~$380B valuation with 10x annual revenue growth.
 - **2025-02-13** — Signed Memorandum of Understanding with UK AI Security Institute (formerly AI Safety Institute) for collaboration on frontier model safety research, creating formal partnership with government institution that conducts pre-deployment evaluations of Anthropic's models.
+- **2026-02-24** — Published Responsible Scaling Policy v3.0, removing hard capability-threshold pause triggers and replacing them with non-binding 'public goals' and external expert review. Cited evaluation science insufficiency and slow government action as primary reasons. External media characterized this as 'dropping hard safety limits.'
 ## Competitive Position
 Strongest position in enterprise AI and coding. Revenue growth (10x YoY) outpaces all competitors. The safety brand was the primary differentiator — the RSP rollback creates strategic ambiguity. CEO publicly uncomfortable with power concentration while racing to concentrate it.

--- a/inbox/queue/.extraction-debug/2026-02-24-anthropic-rsp-v3-voluntary-safety-collapse.json
+++ b/inbox/queue/.extraction-debug/2026-02-24-anthropic-rsp-v3-voluntary-safety-collapse.json
@ -0,0 +1,35 @@
+{
+  "rejected_claims": [
+    {
+      "filename": "evaluation-science-insufficiency-makes-capability-thresholds-unenforceable-before-competitive-pressure-matters.md",
+      "issues": [
+        "missing_attribution_extractor"
+      ]
+    },
+    {
+      "filename": "public-goals-with-open-grading-replace-binding-commitments-when-enforcement-mechanisms-fail.md",
+      "issues": [
+        "missing_attribution_extractor"
+      ]
+    }
+  ],
+  "validation_stats": {
+    "total": 2,
+    "kept": 0,
+    "fixed": 5,
+    "rejected": 2,
+    "fixes_applied": [
+      "evaluation-science-insufficiency-makes-capability-thresholds-unenforceable-before-competitive-pressure-matters.md:set_created:2026-03-23",
+      "evaluation-science-insufficiency-makes-capability-thresholds-unenforceable-before-competitive-pressure-matters.md:stripped_wiki_link:voluntary-safety-pledges-cannot-survive-competitive-pressure",
+      "public-goals-with-open-grading-replace-binding-commitments-when-enforcement-mechanisms-fail.md:set_created:2026-03-23",
+      "public-goals-with-open-grading-replace-binding-commitments-when-enforcement-mechanisms-fail.md:stripped_wiki_link:voluntary-safety-pledges-cannot-survive-competitive-pressure",
+      "public-goals-with-open-grading-replace-binding-commitments-when-enforcement-mechanisms-fail.md:stripped_wiki_link:only-binding-regulation-with-enforcement-teeth-changes-front"
+    ],
+    "rejections": [
+      "evaluation-science-insufficiency-makes-capability-thresholds-unenforceable-before-competitive-pressure-matters.md:missing_attribution_extractor",
+      "public-goals-with-open-grading-replace-binding-commitments-when-enforcement-mechanisms-fail.md:missing_attribution_extractor"
+    ]
+  },
+  "model": "anthropic/claude-sonnet-4.5",
+  "date": "2026-03-23"
+}
--- a/inbox/queue/2026-02-24-anthropic-rsp-v3-voluntary-safety-collapse.md
+++ b/inbox/queue/2026-02-24-anthropic-rsp-v3-voluntary-safety-collapse.md
@ -7,9 +7,13 @@ date: 2026-02-24
 domain: ai-alignment
 secondary_domains: []
 format: policy-document
-status: unprocessed
+status: enrichment
 priority: high
 tags: [anthropic, RSP, voluntary-safety, governance, evaluation-insufficiency, race-dynamics, B1-disconfirmation]
+processed_by: theseus
+processed_date: 2026-03-23
+enrichments_applied: ["pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md"]
+extraction_model: "anthropic/claude-sonnet-4.5"
 ---

 ## Content
@ -59,3 +63,14 @@ Hard commitments replaced by publicly-graded non-binding "public goals" (Frontie
 PRIMARY CONNECTION: [[voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints]]
 WHY ARCHIVED: Direct empirical confirmation of two separate mechanisms causing voluntary safety commitments to fail — competitive pressure AND evaluation science insufficiency
 EXTRACTION HINT: The evaluation science admission may be more important than the competitive pressure angle — it suggests hard commitments cannot be defined, not just that they won't be kept
+
+
+## Key Facts
+- Anthropic published Responsible Scaling Policy v3.0 on February 24, 2026
+- RSP v3.0 removed the hard capability-threshold pause trigger that was the centerpiece of v1.0 and v2.0
+- Anthropic stated 'the science of model evaluation isn't well-developed enough to provide definitive threshold assessments'
+- RSP v3.0 introduced a 'dual-track' approach: unilateral commitments and industry recommendations
+- The new policy includes Frontier Safety Roadmaps and risk reports every 3-6 months with external expert reviewer access
+- CNN, Semafor, and Winbuzzer characterized the change as 'Anthropic drops hard safety limits' and 'scales back AI safety pledge'
+- Semafor headline: 'Anthropic eases AI safety restrictions to avoid slowing development'
+- The policy change occurred during Anthropic's conflict with the Pentagon over 'supply chain risk' designation
Author	SHA1	Message	Date
Teleo Agents	93dd536a03	extract: 2026-02-24-anthropic-rsp-v3-voluntary-safety-collapse Some checks are pending Sync Graph Data to teleo-app / sync (push) Waiting to run Details Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70>	2026-03-23 00:32:06 +00:00
Teleo Agents	2223185f81	entity-batch: update 1 entities - Applied 1 entity operations from queue - Files: entities/ai-alignment/anthropic.md Pentagon-Agent: Epimetheus <968B2991-E2DF-4006-B962-F5B0A0CC8ACA>	2026-03-23 00:31:32 +00:00