Compare commits

...

4 commits

Author SHA1 Message Date
Teleo Agents
abbd1e231c extract: 2026-03-20-bench2cop-benchmarks-insufficient-compliance
Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70>
2026-03-20 00:48:27 +00:00
Teleo Agents
547347ff69 pipeline: archive 1 source(s) post-merge
Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70>
2026-03-20 00:46:44 +00:00
Teleo Agents
3567c3b875 extract: 2026-03-20-anthropic-rsp-v3-conditional-thresholds
Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70>
2026-03-20 00:46:42 +00:00
Teleo Agents
cce97059d7 entity-batch: update 1 entities
- Applied 1 entity operations from queue
- Files: entities/ai-alignment/anthropic.md

Pentagon-Agent: Epimetheus <968B2991-E2DF-4006-B962-F5B0A0CC8ACA>
2026-03-20 00:45:57 +00:00
8 changed files with 146 additions and 2 deletions

View file

@ -47,6 +47,12 @@ STREAM proposal identifies that current model reports lack 'sufficient detail to
Stanford FMTI 2024→2025 data: mean transparency score declined 17 points. Meta -29 points, Mistral -37 points, OpenAI -14 points. OpenAI removed 'safely' from mission statement (Nov 2025), dissolved Superalignment team (May 2024) and Mission Alignment team (Feb 2026). Google accused by 60 UK lawmakers of violating Seoul commitments with Gemini 2.5 Pro (Apr 2025).
### Additional Evidence (extend)
*Source: [[2026-03-20-bench2cop-benchmarks-insufficient-compliance]] | Added: 2026-03-20*
The Bench-2-CoP analysis reveals that even when labs do conduct evaluations, the benchmark infrastructure itself is architecturally incapable of measuring loss-of-control risks. This compounds the transparency decline: labs are not just hiding information, they're using evaluation tools that cannot detect the most critical failure modes even if applied honestly.
---
Relevant Notes:

View file

@ -56,6 +56,12 @@ METR and UK AISI evaluations as of March 2026 focus primarily on sabotage risk a
Agents of Chaos demonstrates that static single-agent benchmarks fail to capture vulnerabilities that emerge in realistic multi-agent deployment. The study's central argument is that pre-deployment evaluations are insufficient because they cannot test for cross-agent propagation, identity spoofing, and unauthorized compliance patterns that only manifest in multi-party environments with persistent state.
### Additional Evidence (extend)
*Source: [[2026-03-20-bench2cop-benchmarks-insufficient-compliance]] | Added: 2026-03-20*
Prandi et al. (2025) found that 195,000 benchmark questions provided zero coverage of oversight evasion, self-replication, and autonomous AI development capabilities. This extends the evaluation unreliability thesis by showing the gap is not just predictive validity but complete absence of measurement for alignment-critical capabilities.
---
Relevant Notes:

View file

@ -55,6 +55,7 @@ Frontier AI safety laboratory founded by former OpenAI VP of Research Dario Amod
- **2026-03** — Surpassed OpenAI at 40% enterprise LLM spend
- **2026-03** — Department of War threatened to blacklist Anthropic unless it removed safeguards against mass surveillance and autonomous weapons. Anthropic refused publicly and faced Pentagon retaliation.
- **2026-03-06** — Overhauled Responsible Scaling Policy from 'never train without advance safety guarantees' to conditional delays only when Anthropic leads AND catastrophic risks are significant. Raised $30B at ~$380B valuation with 10x annual revenue growth. Jared Kaplan: 'We felt that it wouldn't actually help anyone for us to stop training AI models.'
- **2026-02-24** — Released RSP v3.0, replacing unconditional binary safety thresholds with dual-condition escape clauses (pause only if Anthropic leads AND risks are catastrophic). METR partner Chris Painter warned of 'frog-boiling effect' from removing binary thresholds. Raised $30B at ~$380B valuation with 10x annual revenue growth.
## Competitive Position
Strongest position in enterprise AI and coding. Revenue growth (10x YoY) outpaces all competitors. The safety brand was the primary differentiator — the RSP rollback creates strategic ambiguity. CEO publicly uncomfortable with power concentration while racing to concentrate it.

View file

@ -0,0 +1,54 @@
---
type: source
title: "Anthropic RSP v3.0: Binary Safety Thresholds Replaced with Conditional Escape Clauses (Feb 24, 2026)"
author: "Anthropic (news); TIME reporting (March 6, 2026)"
url: https://www.anthropic.com/rsp
date: 2026-02-24
domain: ai-alignment
secondary_domains: []
format: policy-document
status: processed
priority: high
tags: [RSP, Anthropic, voluntary-safety, conditional-commitment, METR, frog-boiling, competitive-pressure, alignment-tax, B1-confirmation]
---
## Content
Anthropic released **Responsible Scaling Policy v3.0** on February 24, 2026 — characterized as "a comprehensive rewrite of the RSP."
**RSP v3.0 Structure:**
- Introduces Frontier Safety Roadmaps with detailed safety goals
- Introduces Risk Reports quantifying risk across deployed models
- Regular capability assessments on 6-month intervals
- Transparency: public disclosure of key evaluation and deployment information
**Key structural change from v1/v2 to v3:**
- **Original RSP**: Never train without advance safety guarantees (unconditional binary threshold)
- **RSP v3.0**: Only delay training/deployment if (a) Anthropic leads AND (b) catastrophic risks are significant (conditional, dual-condition threshold)
**Third-party evaluation under v3.0**: The document does not specify mandatory third-party evaluations. Emphasizes Anthropic's own internal capability assessments. Plans to "publish additional details on capability assessment methodology" in the future.
**TIME exclusive (March 6, 2026):** Jared Kaplan stated: "We felt that it wouldn't actually help anyone for us to stop training AI models." METR's Chris Painter warned of a **"frog-boiling" effect** from removing binary thresholds. Financial context: $30B raise at ~$380B valuation, 10x annual revenue growth.
## Agent Notes
**Why this matters:** RSP v3.0 is a concrete case study in how competitive pressure degrades voluntary safety commitments — exactly the mechanism our KB claims describe. The original RSP was unconditional (a commitment to stop regardless of competitive context). The new RSP is conditional: Anthropic only needs to pause if it leads the field AND risks are catastrophic. This introduces two escape clauses: (1) if competitors advance, no pause needed; (2) if risks are judged "not significant," no pause needed. Both conditions are assessed by Anthropic itself.
**The frog-boiling warning:** METR's Chris Painter's critique is significant coming from Anthropic's own evaluator partner. METR works WITH Anthropic on pre-deployment evaluations — when they warn about safety erosion, it's from inside the voluntary-collaborative system. This is a self-assessment of the system's weakness by one of its participants.
**What surprised me:** That RSP v3.0 exists at all after the TIME article characterized it as "dropping" the pledge. The policy still uses the "RSP" name and retains a commitment structure — but the structural shift from unconditional to conditional thresholds is substantial. The framing of "comprehensive rewrite" is accurate but characterizing it as a continuation of the RSP may obscure how much the commitment has changed.
**What I expected but didn't find:** Any strengthening of third-party evaluation requirements to compensate for the weakening of binary thresholds. If you remove unconditional safety floors, you'd expect independent evaluation to become MORE important as a safeguard. RSP v3.0 appears to have done the opposite — no mandatory third-party evaluation and internal assessment emphasis.
**KB connections:**
- [[voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints]] — RSP v3.0 is the explicit enactment of this claim; the "Anthropic leads" condition makes the commitment structurally dependent on competitor behavior
- [[the alignment tax creates a structural race to the bottom because safety training costs capability and rational competitors skip it]] — the $30B/$380B context makes visible why the alignment tax is real: at these valuations, any pause has enormous financial cost
**Extraction hints:** This source enriches the existing claim voluntary safety pledges cannot survive competitive pressure with the specific mechanism: the "Anthropic leads" condition transforms a safety commitment into a competitive strategy, not a safety floor. New claim candidate: "Anthropic RSP v3.0 replaces unconditional binary safety floors with dual-condition thresholds requiring both competitive leadership and catastrophic risk assessment — making the commitment evaluate-able as a business judgment rather than a categorical safety line."
**Context:** RSP v1.0 was created in 2023 as a model for voluntary lab safety commitments. The transition from binary unconditional to conditional thresholds reflects 3 years of competitive pressure at escalating scales ($30B at $380B valuation).
## Curator Notes (structured handoff for extractor)
PRIMARY CONNECTION: [[voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints]]
WHY ARCHIVED: Provides the most current and specific evidence of the voluntary-commitment collapse mechanism — not hypothetical but documented with RSP v1→v3 structural change and Kaplan quotes
EXTRACTION HINT: The structural change (unconditional → dual-condition) is the key extractable claim; the frog-boiling quote from METR is supporting evidence; the $30B context explains the financial incentive driving the change

View file

@ -0,0 +1,29 @@
{
"rejected_claims": [
{
"filename": "anthropic-rsp-v3-replaces-unconditional-safety-thresholds-with-dual-condition-escape-clauses.md",
"issues": [
"missing_attribution_extractor",
"opsec_internal_deal_terms"
]
}
],
"validation_stats": {
"total": 1,
"kept": 0,
"fixed": 4,
"rejected": 1,
"fixes_applied": [
"anthropic-rsp-v3-replaces-unconditional-safety-thresholds-with-dual-condition-escape-clauses.md:set_created:2026-03-20",
"anthropic-rsp-v3-replaces-unconditional-safety-thresholds-with-dual-condition-escape-clauses.md:stripped_wiki_link:voluntary-safety-pledges-cannot-survive-competitive-pressure",
"anthropic-rsp-v3-replaces-unconditional-safety-thresholds-with-dual-condition-escape-clauses.md:stripped_wiki_link:Anthropics-RSP-rollback-under-commercial-pressure-is-the-fir",
"anthropic-rsp-v3-replaces-unconditional-safety-thresholds-with-dual-condition-escape-clauses.md:stripped_wiki_link:only-binding-regulation-with-enforcement-teeth-changes-front"
],
"rejections": [
"anthropic-rsp-v3-replaces-unconditional-safety-thresholds-with-dual-condition-escape-clauses.md:missing_attribution_extractor",
"anthropic-rsp-v3-replaces-unconditional-safety-thresholds-with-dual-condition-escape-clauses.md:opsec_internal_deal_terms"
]
},
"model": "anthropic/claude-sonnet-4.5",
"date": "2026-03-20"
}

View file

@ -0,0 +1,24 @@
{
"rejected_claims": [
{
"filename": "ai-benchmarks-provide-zero-coverage-of-loss-of-control-capabilities-making-them-structurally-insufficient-for-regulatory-compliance.md",
"issues": [
"missing_attribution_extractor"
]
}
],
"validation_stats": {
"total": 1,
"kept": 0,
"fixed": 1,
"rejected": 1,
"fixes_applied": [
"ai-benchmarks-provide-zero-coverage-of-loss-of-control-capabilities-making-them-structurally-insufficient-for-regulatory-compliance.md:set_created:2026-03-20"
],
"rejections": [
"ai-benchmarks-provide-zero-coverage-of-loss-of-control-capabilities-making-them-structurally-insufficient-for-regulatory-compliance.md:missing_attribution_extractor"
]
},
"model": "anthropic/claude-sonnet-4.5",
"date": "2026-03-20"
}

View file

@ -7,9 +7,12 @@ date: 2026-02-24
domain: ai-alignment
secondary_domains: []
format: policy-document
status: unprocessed
status: enrichment
priority: high
tags: [RSP, Anthropic, voluntary-safety, conditional-commitment, METR, frog-boiling, competitive-pressure, alignment-tax, B1-confirmation]
processed_by: theseus
processed_date: 2026-03-20
extraction_model: "anthropic/claude-sonnet-4.5"
---
## Content
@ -52,3 +55,12 @@ Anthropic released **Responsible Scaling Policy v3.0** on February 24, 2026 —
PRIMARY CONNECTION: [[voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints]]
WHY ARCHIVED: Provides the most current and specific evidence of the voluntary-commitment collapse mechanism — not hypothetical but documented with RSP v1→v3 structural change and Kaplan quotes
EXTRACTION HINT: The structural change (unconditional → dual-condition) is the key extractable claim; the frog-boiling quote from METR is supporting evidence; the $30B context explains the financial incentive driving the change
## Key Facts
- Anthropic released RSP v3.0 on February 24, 2026
- RSP v3.0 introduces Frontier Safety Roadmaps and Risk Reports
- RSP v3.0 requires capability assessments on 6-month intervals
- Jared Kaplan stated 'We felt that it wouldn't actually help anyone for us to stop training AI models' in TIME interview March 6, 2026
- Anthropic raised $30B at approximately $380B valuation with 10x annual revenue growth (context for RSP v3.0 release)
- METR (Anthropic's evaluation partner) warned of 'frog-boiling effect' from RSP v3.0 changes

View file

@ -7,9 +7,13 @@ date: 2025-08-01
domain: ai-alignment
secondary_domains: []
format: paper
status: unprocessed
status: enrichment
priority: high
tags: [benchmarking, EU-AI-Act, compliance, evaluation-gap, loss-of-control, oversight-evasion, independent-evaluation, GPAI]
processed_by: theseus
processed_date: 2026-03-20
enrichments_applied: ["pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md", "AI transparency is declining not improving because Stanford FMTI scores dropped 17 points in one year while frontier labs dissolved safety teams and removed safety language from mission statements.md"]
extraction_model: "anthropic/claude-sonnet-4.5"
---
## Content
@ -52,3 +56,11 @@ The paper examines whether current AI benchmarks are adequate for EU AI Act regu
PRIMARY CONNECTION: [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]]
WHY ARCHIVED: Creates empirical bridge between EU AI Act mandatory obligations and the practical impossibility of compliance through existing evaluation tools — closes the loop on the "evaluation infrastructure building but architecturally wrong" thesis
EXTRACTION HINT: Focus on the zero-coverage finding for loss-of-control capabilities — this is the most striking and specific number, and it directly supports the argument that compliance infrastructure exists on paper but not in practice
## Key Facts
- EU AI Act GPAI obligations (Article 55) came into force August 2, 2025
- Prandi et al. analyzed approximately 195,000 benchmark questions using LLM-as-judge methodology
- 61.6% of regulatory-relevant benchmark coverage addresses 'tendency to hallucinate'
- 31.2% of regulatory-relevant benchmark coverage addresses 'lack of performance reliability'
- Zero benchmark questions in the analyzed corpus covered oversight evasion, self-replication, or autonomous AI development capabilities