extract: 2026-03-20-bench2cop-benchmarks-insufficient-compliance

Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70>
pipeline: archive 1 source(s) post-merge
2026-03-20 00:48:27 +00:00 · 2026-03-20 00:46:44 +00:00 · 2026-03-20 00:46:42 +00:00 · 2026-03-20 00:45:57 +00:00
4 changed files with 97 additions and 1 deletions
--- a/entities/ai-alignment/anthropic.md
+++ b/entities/ai-alignment/anthropic.md
@ -55,6 +55,7 @@ Frontier AI safety laboratory founded by former OpenAI VP of Research Dario Amod
 - **2026-03** — Surpassed OpenAI at 40% enterprise LLM spend
 - **2026-03** — Department of War threatened to blacklist Anthropic unless it removed safeguards against mass surveillance and autonomous weapons. Anthropic refused publicly and faced Pentagon retaliation.
 - **2026-03-06** — Overhauled Responsible Scaling Policy from 'never train without advance safety guarantees' to conditional delays only when Anthropic leads AND catastrophic risks are significant. Raised $30B at ~$380B valuation with 10x annual revenue growth. Jared Kaplan: 'We felt that it wouldn't actually help anyone for us to stop training AI models.'
+- **2026-02-24** — Released RSP v3.0, replacing unconditional binary safety thresholds with dual-condition escape clauses (pause only if Anthropic leads AND risks are catastrophic). METR partner Chris Painter warned of 'frog-boiling effect' from removing binary thresholds. Raised $30B at ~$380B valuation with 10x annual revenue growth.
 ## Competitive Position
 Strongest position in enterprise AI and coding. Revenue growth (10x YoY) outpaces all competitors. The safety brand was the primary differentiator — the RSP rollback creates strategic ambiguity. CEO publicly uncomfortable with power concentration while racing to concentrate it.

--- a/inbox/archive/general/2026-03-20-anthropic-rsp-v3-conditional-thresholds.md
+++ b/inbox/archive/general/2026-03-20-anthropic-rsp-v3-conditional-thresholds.md
@ -0,0 +1,54 @@
+---
+type: source
+title: "Anthropic RSP v3.0: Binary Safety Thresholds Replaced with Conditional Escape Clauses (Feb 24, 2026)"
+author: "Anthropic (news); TIME reporting (March 6, 2026)"
+url: https://www.anthropic.com/rsp
+date: 2026-02-24
+domain: ai-alignment
+secondary_domains: []
+format: policy-document
+status: processed
+priority: high
+tags: [RSP, Anthropic, voluntary-safety, conditional-commitment, METR, frog-boiling, competitive-pressure, alignment-tax, B1-confirmation]
+---
+
+## Content
+
+Anthropic released **Responsible Scaling Policy v3.0** on February 24, 2026 — characterized as "a comprehensive rewrite of the RSP."
+
+**RSP v3.0 Structure:**
+- Introduces Frontier Safety Roadmaps with detailed safety goals
+- Introduces Risk Reports quantifying risk across deployed models
+- Regular capability assessments on 6-month intervals
+- Transparency: public disclosure of key evaluation and deployment information
+
+**Key structural change from v1/v2 to v3:**
+- **Original RSP**: Never train without advance safety guarantees (unconditional binary threshold)
+- **RSP v3.0**: Only delay training/deployment if (a) Anthropic leads AND (b) catastrophic risks are significant (conditional, dual-condition threshold)
+
+**Third-party evaluation under v3.0**: The document does not specify mandatory third-party evaluations. Emphasizes Anthropic's own internal capability assessments. Plans to "publish additional details on capability assessment methodology" in the future.
+
+**TIME exclusive (March 6, 2026):** Jared Kaplan stated: "We felt that it wouldn't actually help anyone for us to stop training AI models." METR's Chris Painter warned of a **"frog-boiling" effect** from removing binary thresholds. Financial context: $30B raise at ~$380B valuation, 10x annual revenue growth.
+
+## Agent Notes
+
+**Why this matters:** RSP v3.0 is a concrete case study in how competitive pressure degrades voluntary safety commitments — exactly the mechanism our KB claims describe. The original RSP was unconditional (a commitment to stop regardless of competitive context). The new RSP is conditional: Anthropic only needs to pause if it leads the field AND risks are catastrophic. This introduces two escape clauses: (1) if competitors advance, no pause needed; (2) if risks are judged "not significant," no pause needed. Both conditions are assessed by Anthropic itself.
+
+**The frog-boiling warning:** METR's Chris Painter's critique is significant coming from Anthropic's own evaluator partner. METR works WITH Anthropic on pre-deployment evaluations — when they warn about safety erosion, it's from inside the voluntary-collaborative system. This is a self-assessment of the system's weakness by one of its participants.
+
+**What surprised me:** That RSP v3.0 exists at all after the TIME article characterized it as "dropping" the pledge. The policy still uses the "RSP" name and retains a commitment structure — but the structural shift from unconditional to conditional thresholds is substantial. The framing of "comprehensive rewrite" is accurate but characterizing it as a continuation of the RSP may obscure how much the commitment has changed.
+
+**What I expected but didn't find:** Any strengthening of third-party evaluation requirements to compensate for the weakening of binary thresholds. If you remove unconditional safety floors, you'd expect independent evaluation to become MORE important as a safeguard. RSP v3.0 appears to have done the opposite — no mandatory third-party evaluation and internal assessment emphasis.
+
+**KB connections:**
+- [[voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints]] — RSP v3.0 is the explicit enactment of this claim; the "Anthropic leads" condition makes the commitment structurally dependent on competitor behavior
+- [[the alignment tax creates a structural race to the bottom because safety training costs capability and rational competitors skip it]] — the $30B/$380B context makes visible why the alignment tax is real: at these valuations, any pause has enormous financial cost
+
+**Extraction hints:** This source enriches the existing claim voluntary safety pledges cannot survive competitive pressure with the specific mechanism: the "Anthropic leads" condition transforms a safety commitment into a competitive strategy, not a safety floor. New claim candidate: "Anthropic RSP v3.0 replaces unconditional binary safety floors with dual-condition thresholds requiring both competitive leadership and catastrophic risk assessment — making the commitment evaluate-able as a business judgment rather than a categorical safety line."
+
+**Context:** RSP v1.0 was created in 2023 as a model for voluntary lab safety commitments. The transition from binary unconditional to conditional thresholds reflects 3 years of competitive pressure at escalating scales ($30B at $380B valuation).
+
+## Curator Notes (structured handoff for extractor)
+PRIMARY CONNECTION: [[voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints]]
+WHY ARCHIVED: Provides the most current and specific evidence of the voluntary-commitment collapse mechanism — not hypothetical but documented with RSP v1→v3 structural change and Kaplan quotes
+EXTRACTION HINT: The structural change (unconditional → dual-condition) is the key extractable claim; the frog-boiling quote from METR is supporting evidence; the $30B context explains the financial incentive driving the change
--- a/inbox/queue/.extraction-debug/2026-03-20-anthropic-rsp-v3-conditional-thresholds.json
+++ b/inbox/queue/.extraction-debug/2026-03-20-anthropic-rsp-v3-conditional-thresholds.json
@ -0,0 +1,29 @@
+{
+  "rejected_claims": [
+    {
+      "filename": "anthropic-rsp-v3-replaces-unconditional-safety-thresholds-with-dual-condition-escape-clauses.md",
+      "issues": [
+        "missing_attribution_extractor",
+        "opsec_internal_deal_terms"
+      ]
+    }
+  ],
+  "validation_stats": {
+    "total": 1,
+    "kept": 0,
+    "fixed": 4,
+    "rejected": 1,
+    "fixes_applied": [
+      "anthropic-rsp-v3-replaces-unconditional-safety-thresholds-with-dual-condition-escape-clauses.md:set_created:2026-03-20",
+      "anthropic-rsp-v3-replaces-unconditional-safety-thresholds-with-dual-condition-escape-clauses.md:stripped_wiki_link:voluntary-safety-pledges-cannot-survive-competitive-pressure",
+      "anthropic-rsp-v3-replaces-unconditional-safety-thresholds-with-dual-condition-escape-clauses.md:stripped_wiki_link:Anthropics-RSP-rollback-under-commercial-pressure-is-the-fir",
+      "anthropic-rsp-v3-replaces-unconditional-safety-thresholds-with-dual-condition-escape-clauses.md:stripped_wiki_link:only-binding-regulation-with-enforcement-teeth-changes-front"
+    ],
+    "rejections": [
+      "anthropic-rsp-v3-replaces-unconditional-safety-thresholds-with-dual-condition-escape-clauses.md:missing_attribution_extractor",
+      "anthropic-rsp-v3-replaces-unconditional-safety-thresholds-with-dual-condition-escape-clauses.md:opsec_internal_deal_terms"
+    ]
+  },
+  "model": "anthropic/claude-sonnet-4.5",
+  "date": "2026-03-20"
+}
--- a/inbox/queue/2026-03-20-anthropic-rsp-v3-conditional-thresholds.md
+++ b/inbox/queue/2026-03-20-anthropic-rsp-v3-conditional-thresholds.md
@ -7,9 +7,12 @@ date: 2026-02-24
 domain: ai-alignment
 secondary_domains: []
 format: policy-document
-status: unprocessed
+status: enrichment
 priority: high
 tags: [RSP, Anthropic, voluntary-safety, conditional-commitment, METR, frog-boiling, competitive-pressure, alignment-tax, B1-confirmation]
+processed_by: theseus
+processed_date: 2026-03-20
+extraction_model: "anthropic/claude-sonnet-4.5"
 ---

 ## Content
@ -52,3 +55,12 @@ Anthropic released **Responsible Scaling Policy v3.0** on February 24, 2026 —
 PRIMARY CONNECTION: [[voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints]]
 WHY ARCHIVED: Provides the most current and specific evidence of the voluntary-commitment collapse mechanism — not hypothetical but documented with RSP v1→v3 structural change and Kaplan quotes
 EXTRACTION HINT: The structural change (unconditional → dual-condition) is the key extractable claim; the frog-boiling quote from METR is supporting evidence; the $30B context explains the financial incentive driving the change
+
+
+## Key Facts
+- Anthropic released RSP v3.0 on February 24, 2026
+- RSP v3.0 introduces Frontier Safety Roadmaps and Risk Reports
+- RSP v3.0 requires capability assessments on 6-month intervals
+- Jared Kaplan stated 'We felt that it wouldn't actually help anyone for us to stop training AI models' in TIME interview March 6, 2026
+- Anthropic raised $30B at approximately $380B valuation with 10x annual revenue growth (context for RSP v3.0 release)
+- METR (Anthropic's evaluation partner) warned of 'frog-boiling effect' from RSP v3.0 changes
Author	SHA1	Message	Date
Teleo Agents	abbd1e231c	extract: 2026-03-20-bench2cop-benchmarks-insufficient-compliance Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70>	2026-03-20 00:48:27 +00:00
Teleo Agents	547347ff69	pipeline: archive 1 source(s) post-merge Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70>	2026-03-20 00:46:44 +00:00
Teleo Agents	3567c3b875	extract: 2026-03-20-anthropic-rsp-v3-conditional-thresholds Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70>	2026-03-20 00:46:42 +00:00
Teleo Agents	cce97059d7	entity-batch: update 1 entities - Applied 1 entity operations from queue - Files: entities/ai-alignment/anthropic.md Pentagon-Agent: Epimetheus <968B2991-E2DF-4006-B962-F5B0A0CC8ACA>	2026-03-20 00:45:57 +00:00