Compare commits

...

2 commits

Author SHA1 Message Date
Teleo Agents
93dd536a03 extract: 2026-02-24-anthropic-rsp-v3-voluntary-safety-collapse
Some checks are pending
Sync Graph Data to teleo-app / sync (push) Waiting to run
Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70>
2026-03-23 00:32:06 +00:00
Teleo Agents
2223185f81 entity-batch: update 1 entities
- Applied 1 entity operations from queue
- Files: entities/ai-alignment/anthropic.md

Pentagon-Agent: Epimetheus <968B2991-E2DF-4006-B962-F5B0A0CC8ACA>
2026-03-23 00:31:32 +00:00
4 changed files with 58 additions and 1 deletions

View file

@ -99,6 +99,12 @@ METR recommended 'deeper investigations of evaluation awareness and obfuscated m
IAISR 2026 states that 'pre-deployment testing increasingly fails to predict real-world model behavior,' providing authoritative international consensus confirmation that the evaluation-deployment gap is widening. The report explicitly connects this to dangerous capabilities going undetected, confirming the governance implications. IAISR 2026 states that 'pre-deployment testing increasingly fails to predict real-world model behavior,' providing authoritative international consensus confirmation that the evaluation-deployment gap is widening. The report explicitly connects this to dangerous capabilities going undetected, confirming the governance implications.
### Additional Evidence (confirm)
*Source: [[2026-02-24-anthropic-rsp-v3-voluntary-safety-collapse]] | Added: 2026-03-23*
Anthropic's explicit admission that 'the science of model evaluation isn't well-developed enough to provide definitive threshold assessments' is direct confirmation from a frontier lab that evaluation tools are insufficient for governance. This aligns with METR's March 2026 modeling assumptions note, suggesting field-wide consensus that current evaluation science cannot support the governance structures built on top of it.

View file

@ -57,6 +57,7 @@ Frontier AI safety laboratory founded by former OpenAI VP of Research Dario Amod
- **2026-03-06** — Overhauled Responsible Scaling Policy from 'never train without advance safety guarantees' to conditional delays only when Anthropic leads AND catastrophic risks are significant. Raised $30B at ~$380B valuation with 10x annual revenue growth. Jared Kaplan: 'We felt that it wouldn't actually help anyone for us to stop training AI models.' - **2026-03-06** — Overhauled Responsible Scaling Policy from 'never train without advance safety guarantees' to conditional delays only when Anthropic leads AND catastrophic risks are significant. Raised $30B at ~$380B valuation with 10x annual revenue growth. Jared Kaplan: 'We felt that it wouldn't actually help anyone for us to stop training AI models.'
- **2026-02-24** — Released RSP v3.0, replacing unconditional binary safety thresholds with dual-condition escape clauses (pause only if Anthropic leads AND risks are catastrophic). METR partner Chris Painter warned of 'frog-boiling effect' from removing binary thresholds. Raised $30B at ~$380B valuation with 10x annual revenue growth. - **2026-02-24** — Released RSP v3.0, replacing unconditional binary safety thresholds with dual-condition escape clauses (pause only if Anthropic leads AND risks are catastrophic). METR partner Chris Painter warned of 'frog-boiling effect' from removing binary thresholds. Raised $30B at ~$380B valuation with 10x annual revenue growth.
- **2025-02-13** — Signed Memorandum of Understanding with UK AI Security Institute (formerly AI Safety Institute) for collaboration on frontier model safety research, creating formal partnership with government institution that conducts pre-deployment evaluations of Anthropic's models. - **2025-02-13** — Signed Memorandum of Understanding with UK AI Security Institute (formerly AI Safety Institute) for collaboration on frontier model safety research, creating formal partnership with government institution that conducts pre-deployment evaluations of Anthropic's models.
- **2026-02-24** — Published Responsible Scaling Policy v3.0, removing hard capability-threshold pause triggers and replacing them with non-binding 'public goals' and external expert review. Cited evaluation science insufficiency and slow government action as primary reasons. External media characterized this as 'dropping hard safety limits.'
## Competitive Position ## Competitive Position
Strongest position in enterprise AI and coding. Revenue growth (10x YoY) outpaces all competitors. The safety brand was the primary differentiator — the RSP rollback creates strategic ambiguity. CEO publicly uncomfortable with power concentration while racing to concentrate it. Strongest position in enterprise AI and coding. Revenue growth (10x YoY) outpaces all competitors. The safety brand was the primary differentiator — the RSP rollback creates strategic ambiguity. CEO publicly uncomfortable with power concentration while racing to concentrate it.

View file

@ -0,0 +1,35 @@
{
"rejected_claims": [
{
"filename": "evaluation-science-insufficiency-makes-capability-thresholds-unenforceable-before-competitive-pressure-matters.md",
"issues": [
"missing_attribution_extractor"
]
},
{
"filename": "public-goals-with-open-grading-replace-binding-commitments-when-enforcement-mechanisms-fail.md",
"issues": [
"missing_attribution_extractor"
]
}
],
"validation_stats": {
"total": 2,
"kept": 0,
"fixed": 5,
"rejected": 2,
"fixes_applied": [
"evaluation-science-insufficiency-makes-capability-thresholds-unenforceable-before-competitive-pressure-matters.md:set_created:2026-03-23",
"evaluation-science-insufficiency-makes-capability-thresholds-unenforceable-before-competitive-pressure-matters.md:stripped_wiki_link:voluntary-safety-pledges-cannot-survive-competitive-pressure",
"public-goals-with-open-grading-replace-binding-commitments-when-enforcement-mechanisms-fail.md:set_created:2026-03-23",
"public-goals-with-open-grading-replace-binding-commitments-when-enforcement-mechanisms-fail.md:stripped_wiki_link:voluntary-safety-pledges-cannot-survive-competitive-pressure",
"public-goals-with-open-grading-replace-binding-commitments-when-enforcement-mechanisms-fail.md:stripped_wiki_link:only-binding-regulation-with-enforcement-teeth-changes-front"
],
"rejections": [
"evaluation-science-insufficiency-makes-capability-thresholds-unenforceable-before-competitive-pressure-matters.md:missing_attribution_extractor",
"public-goals-with-open-grading-replace-binding-commitments-when-enforcement-mechanisms-fail.md:missing_attribution_extractor"
]
},
"model": "anthropic/claude-sonnet-4.5",
"date": "2026-03-23"
}

View file

@ -7,9 +7,13 @@ date: 2026-02-24
domain: ai-alignment domain: ai-alignment
secondary_domains: [] secondary_domains: []
format: policy-document format: policy-document
status: unprocessed status: enrichment
priority: high priority: high
tags: [anthropic, RSP, voluntary-safety, governance, evaluation-insufficiency, race-dynamics, B1-disconfirmation] tags: [anthropic, RSP, voluntary-safety, governance, evaluation-insufficiency, race-dynamics, B1-disconfirmation]
processed_by: theseus
processed_date: 2026-03-23
enrichments_applied: ["pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md"]
extraction_model: "anthropic/claude-sonnet-4.5"
--- ---
## Content ## Content
@ -59,3 +63,14 @@ Hard commitments replaced by publicly-graded non-binding "public goals" (Frontie
PRIMARY CONNECTION: [[voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints]] PRIMARY CONNECTION: [[voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints]]
WHY ARCHIVED: Direct empirical confirmation of two separate mechanisms causing voluntary safety commitments to fail — competitive pressure AND evaluation science insufficiency WHY ARCHIVED: Direct empirical confirmation of two separate mechanisms causing voluntary safety commitments to fail — competitive pressure AND evaluation science insufficiency
EXTRACTION HINT: The evaluation science admission may be more important than the competitive pressure angle — it suggests hard commitments cannot be defined, not just that they won't be kept EXTRACTION HINT: The evaluation science admission may be more important than the competitive pressure angle — it suggests hard commitments cannot be defined, not just that they won't be kept
## Key Facts
- Anthropic published Responsible Scaling Policy v3.0 on February 24, 2026
- RSP v3.0 removed the hard capability-threshold pause trigger that was the centerpiece of v1.0 and v2.0
- Anthropic stated 'the science of model evaluation isn't well-developed enough to provide definitive threshold assessments'
- RSP v3.0 introduced a 'dual-track' approach: unilateral commitments and industry recommendations
- The new policy includes Frontier Safety Roadmaps and risk reports every 3-6 months with external expert reviewer access
- CNN, Semafor, and Winbuzzer characterized the change as 'Anthropic drops hard safety limits' and 'scales back AI safety pledge'
- Semafor headline: 'Anthropic eases AI safety restrictions to avoid slowing development'
- The policy change occurred during Anthropic's conflict with the Pentagon over 'supply chain risk' designation