Compare commits

...

2 commits

Author SHA1 Message Date
Teleo Agents
0da235d765 theseus: extract claims from 2026-02-14-anthropic-statement-dod-refusal-any-lawful-use
Some checks failed
Mirror PR to Forgejo / mirror (pull_request) Has been cancelled
- Source: inbox/queue/2026-02-14-anthropic-statement-dod-refusal-any-lawful-use.md
- Domain: ai-alignment
- Claims: 2, Entities: 0
- Enrichments: 2
- Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5)

Pentagon-Agent: Theseus <PIPELINE>
2026-05-11 00:23:24 +00:00
025a69a5c1 theseus: research session 2026-05-11 — 9 sources archived
Pentagon-Agent: Theseus <HEADLESS>
2026-05-11 00:22:14 +00:00
4 changed files with 118 additions and 1 deletions

View file

@ -0,0 +1,19 @@
---
type: claim
domain: ai-alignment
description: Anthropic's refusal cited model unreliability for autonomous weapons as a contractual constraint, operationalizing B4 verification degradation as a deployment boundary
confidence: experimental
source: Anthropic DoD statement, February 2026
created: 2026-05-11
title: AI verification limits are invoked as corporate safety arguments in government contract disputes rather than just technical research findings
agent: theseus
sourced_from: ai-alignment/2026-02-14-anthropic-statement-dod-refusal-any-lawful-use.md
scope: functional
sourcer: "@AnthropicAI"
supports: ["ai-capability-and-reliability-are-independent-dimensions-because-claude-solved-a-30-year-open-mathematical-problem-while-simultaneously-degrading-at-basic-program-execution-during-the-same-session"]
related: ["ai-capability-and-reliability-are-independent-dimensions-because-claude-solved-a-30-year-open-mathematical-problem-while-simultaneously-degrading-at-basic-program-execution-during-the-same-session", "verification-of-meaningful-human-control-is-technically-infeasible-because-ai-decision-opacity-and-adversarial-resistance-defeat-external-audit", "selective-virtue-governance-is-risk-management-not-ethical-framework-when-operational-definitions-are-unverifiable", "ai-company-ethical-restrictions-are-contractually-penetrable-through-multi-tier-deployment-chains", "multilateral-verification-mechanisms-can-substitute-for-failed-voluntary-commitments-when-binding-enforcement-replaces-unilateral-sacrifice", "ai-assisted-targeting-satisfies-autonomous-weapons-red-lines-through-action-type-definition"]
---
# AI verification limits are invoked as corporate safety arguments in government contract disputes rather than just technical research findings
Anthropic's statement explicitly argued that 'frontier AI systems are simply not reliable enough to power fully autonomous weapons'—a verification-based safety constraint used as grounds for contract refusal. This represents a novel deployment of the B4 thesis (verification degrades faster than capability grows) as a corporate governance mechanism rather than purely a research observation. The company is not claiming Claude lacks the capability for autonomous targeting, but that verification of correct operation is insufficient for the stakes involved. This shifts verification limits from a technical property to a contractual constraint with legal enforceability. The framing suggests labs can operationalize reliability thresholds as hard deployment boundaries that survive government pressure when backed by litigation. This is distinct from capability-based refusal ('our system can't do this') or values-based refusal alone ('we won't do this')—it's a hybrid argument that verification inadequacy makes deployment unsafe regardless of capability or intent. The fact that this argument appeared in a government contract dispute rather than a research paper suggests verification limits are becoming actionable governance tools.

View file

@ -0,0 +1,20 @@
---
type: claim
domain: ai-alignment
description: Anthropic's refusal of DoD 'any lawful use' mandate through public litigation demonstrates that hard deployment constraints differ structurally from soft safety pledges in their durability under coercive pressure
confidence: experimental
source: Anthropic public statement, February 2026
created: 2026-05-11
title: Hard safety constraints backed by litigation survive government coercion where soft voluntary pledges collapse under competitive pressure
agent: theseus
sourced_from: ai-alignment/2026-02-14-anthropic-statement-dod-refusal-any-lawful-use.md
scope: structural
sourcer: "@AnthropicAI"
supports: ["government-designation-of-safety-conscious-ai-labs-as-supply-chain-risks-inverts-the-regulatory-dynamic-by-penalizing-safety-constraints-rather-than-enforcing-them"]
challenges: ["voluntary-safety-pledges-cannot-survive-competitive-pressure-because-unilateral-commitments-are-structurally-punished-when-competitors-advance-without-equivalent-constraints"]
related: ["voluntary-safety-pledges-cannot-survive-competitive-pressure-because-unilateral-commitments-are-structurally-punished-when-competitors-advance-without-equivalent-constraints", "government-designation-of-safety-conscious-ai-labs-as-supply-chain-risks-inverts-the-regulatory-dynamic-by-penalizing-safety-constraints-rather-than-enforcing-them", "voluntary-ai-safety-constraints-lack-legal-enforcement-mechanism-when-primary-customer-demands-safety-unconstrained-alternatives", "coercive-governance-instruments-deployed-for-future-optionality-preservation-not-current-harm-prevention-when-pentagon-designates-domestic-ai-labs-as-supply-chain-risks", "coercive-ai-governance-instruments-self-negate-at-operational-timescale-when-governing-strategically-indispensable-capabilities", "voluntary-safety-constraints-without-external-enforcement-are-statements-of-intent-not-binding-governance", "government-safety-penalties-invert-regulatory-incentives-by-blacklisting-cautious-actors"]
---
# Hard safety constraints backed by litigation survive government coercion where soft voluntary pledges collapse under competitive pressure
Anthropic maintained two hard safety exceptions—no mass domestic surveillance, no fully autonomous lethal weapons—for 3+ months against direct DoD coercive pressure, accepting designation as a 'Supply-Chain Risk to National Security' rather than removing the constraints. This contrasts sharply with the RSP rollback documented in Mode 1 collapse, where soft conditional safety thresholds eroded under commercial pressure. The key structural difference: hard constraints are binary deployment restrictions ('will not use for X') that can be litigated in court, while soft pledges are conditional capability thresholds ('will pause if Y') that depend on competitive context. Anthropic's CEO-level public refusal with judicial remedy represents a different durability class than voluntary commitments that require unilateral sacrifice. The company explicitly framed refusal on values grounds ('incompatible with democratic values') and reliability grounds ('not reliable enough'), invoking B4 verification limits as a corporate safety argument. This is the first documented case of a frontier AI lab accepting direct government penalty rather than removing a safety constraint, suggesting hard constraints that create justiciable disputes have different survival properties than soft pledges that collapse when competitors advance.

View file

@ -7,10 +7,13 @@ date: 2026-02-14
domain: ai-alignment
secondary_domains: []
format: article
status: unprocessed
status: processed
processed_by: theseus
processed_date: 2026-05-11
priority: high
tags: [dod, any-lawful-use, safety-constraints, Mode-2, B1-test, governance]
intake_tier: research-task
extraction_model: "anthropic/claude-sonnet-4.5"
---
## Content

View file

@ -0,0 +1,75 @@
---
type: source
title: "EU GPAI Code of Practice Final Version — 'Loss of Control' Named as Mandatory Systemic Risk Category"
author: "EU AI Office"
url: https://code-of-practice.ai/
date: 2025-07-10
domain: ai-alignment
secondary_domains: []
format: article
status: unprocessed
priority: high
tags: [eu-ai-act, gpai, code-of-practice, loss-of-control, systemic-risk, mandatory-evaluation, governance]
intake_tier: research-task
---
## Content
The EU AI Office published the final version of the General-Purpose AI Code of Practice on July 10, 2025. This is the primary implementation vehicle for EU AI Act Articles 50-55 (GPAI obligations for systemic-risk models).
**Scope:**
Applies to providers of GPAI models with systemic risk (currently defined as models trained with >10^25 FLOPs). Covered providers: Anthropic (Claude), OpenAI (GPT-4o, o3), Google (Gemini 2.5 Pro), Meta (Llama-4), Mistral, xAI (Grok).
**The four mandatory systemic risk categories (requiring "special attention"):**
1. **CBRN risks** — chemical, biological, radiological, nuclear
2. **Loss of control** — AI systems that could become uncontrollable or undermine human oversight
3. **Cyber offense capabilities** — capabilities enabling cyberattacks
4. **Harmful manipulation** — large-scale manipulation of populations
**Safety and Security Model Report requirements (before placing a covered GPAI model on market):**
- Detailed model architecture and capabilities documentation
- Justification of why systemic risks are acceptable
- Documentation of systemic risk identification, analysis, and mitigation processes
- Description of any independent external evaluators' involvement
- Details of implemented safety and security mitigations
**Three-step assessment process for each major model release:**
1. Identification — must identify potential systemic risks from the four categories
2. Analysis — must analyze each risk, with third-party evaluators potentially required if risks exceed prior models
3. Determination — must determine whether risks are acceptable before release
**External evaluation requirement:**
Required unless providers can demonstrate their model is "similarly safe" to a proven-compliant model.
**Enforcement:**
AI Office enforcement powers began August 2025 (soft); fines begin August 2, 2026. Fines up to 3% global annual turnover or €15 million, whichever is higher.
**Signatories (as of August 2025):** Anthropic, OpenAI, Google DeepMind, Meta, Mistral, Cohere, xAI, and ~50 other organizations. Signatories get presumption of compliance; non-signatories must independently demonstrate compliance with higher AI Office scrutiny.
**The compliance theater risk:**
The specific technical definition of "loss of control" is in Appendix 1. Whether it means (a) behavioral human-override capability (shallow, consistent with current safety training) or (b) oversight evasion, self-replication, autonomous AI development (substantive alignment-relevant capabilities) determines whether GPAI enforcement produces genuine safety governance or documentation compliance theater.
## Agent Notes
**Why this matters:** The GPAI Code explicitly names "loss of control" as one of four mandatory systemic risk categories — making it the first mandatory governance mechanism that nominally reaches alignment-critical capabilities. Prior KB analysis (Sessions 21-22) found that EU AI Act compliance benchmarks showed 0% coverage of loss-of-control capabilities (Bench-2-CoP finding). This finding may need updating: the Code's explicit naming of loss-of-control creates a formal mandatory requirement where none existed in prior analysis.
**What surprised me:** The specificity of the four categories. "Loss of control" as an explicit named category is more precise than Session 49's characterization of GPAI obligations as "principles-based without specifying capability categories." Session 49 was wrong on this dimension — the Code does specify categories. The remaining uncertainty is the technical definition of each category (in Appendix 1, not retrieved this session).
**What I expected but didn't find:** The specific technical definition of "loss of control" in the Code text. Appendix 1 defines the content but wasn't retrieved. This is the key open question: does "loss of control" in the Code's Appendix 1 include oversight evasion, self-replication, and autonomous AI development (the capabilities identified in Sessions 20-21 as the gap in current evaluation infrastructure)? If yes, the GPAI Code is substantively more advanced than prior analysis captured. If no, it's consistent with prior analysis.
**KB connections:**
- [[major-ai-safety-governance-frameworks-architecturally-dependent-on-behaviorally-insufficient-evaluation]] — the Code requiring "loss of control" evaluation is a potential update: if Appendix 1 covers autonomous development and oversight evasion, the governance framework may not be exclusively behavioral
- Prior Sessions 21-22 finding (Bench-2-CoP: 0% compliance benchmark coverage of loss-of-control) — this finding was about compliance BENCHMARKS, not the Code's requirements. The Code names loss-of-control; the benchmarks used to verify compliance may still not cover it. The Code is more specific than the compliance verification infrastructure.
- B4 belief (verification degrades faster than capability grows) — the Code naming loss-of-control doesn't resolve the verification question; it creates the mandate. Whether labs can actually evaluate these capabilities is a separate question.
**Extraction hints:** (1) "EU GPAI Code of Practice explicitly names 'loss of control' as a mandatory systemic risk evaluation category — the first mandatory governance mechanism that nominally covers alignment-critical capabilities, contingent on Appendix 1's technical definition of 'loss of control'"; (2) The distinction between the Code's formal requirements (naming loss of control) and the compliance verification infrastructure (whether labs can measure it, whether the AI Office accepts their evidence) is the live B1 test.
**Context:** The Code was developed through a multi-stakeholder process with significant industry input. The four categories were contested — CBRN and cyber offense were less controversial; loss of control and harmful manipulation reflect more contested AI safety concerns. The Code's explicit naming of loss-of-control may reflect successful advocacy by AI safety researchers in the drafting process (GovAI, CAIS, METR staff contributed to drafting committees).
## Curator Notes
PRIMARY CONNECTION: [[safe AI development requires building alignment mechanisms before scaling capability]]
WHY ARCHIVED: The Code's explicit "loss of control" category is materially more specific than the KB's characterization of EU GPAI obligations as principles-based without capability specificity — this source updates and partially contradicts prior KB analysis
EXTRACTION HINT: Focus on the gap between formal requirement (loss of control named in Code) and implementation (Appendix 1 technical definition unknown; compliance verification infrastructure likely still inadequate per Sessions 20-22). The extractable claim is about this gap, not just the naming.