theseus: extract claims from 2026-04-25-nordby-cross-model-limitations-family-specific-patterns

- Source: inbox/queue/2026-04-25-nordby-cross-model-limitations-family-specific-patterns.md - Domain: ai-alignment - Claims: 0, Entities: 0 - Enrichments: 3 - Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5) Pentagon-Agent: Theseus <PIPELINE>
leo: extract claims from 2026-02-24-time-anthropic-rsp-v3-pause-commitment-dropped
2026-04-30 02:54:02 +00:00 · 2026-04-30 02:52:53 +00:00
11 changed files with 143 additions and 3 deletions
--- a/domains/ai-alignment/multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent.md
+++ b/domains/ai-alignment/multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent.md
@ -94,3 +94,10 @@ Apollo explicitly acknowledges their probe 'sometimes fires for the topic of dec
 **Source:** Theseus Session 37 synthesis of Nordby et al. and SCAV evidence

 Multi-layer ensemble probes represent a conditional exception to verification degradation for closed-source models. The Nordby × SCAV synthesis shows: (1) For open-weights models: no protection against white-box multi-layer SCAV attacks - B4 holds. (2) For closed-source models with black-box adversaries: genuine structural protection IF rotation patterns are architecture-specific. Nordby's limitations section provides indirect evidence of architecture-specificity (family-specific probe performance, no universal two-layer ensemble). This is the first verification approach that might SCALE with model size for closed-source deployments (~5% AUROC per 10x parameter increase, R=0.81). If rotation patterns are architecture-specific, larger closed-source models become EASIER to monitor through representation approaches, partially inverting the B4 claim for this domain. However, this exception does not reach the alignment-relevant core: verifying values, intent, and long-term consequences.
+
+
+## Challenging Evidence
+
+**Source:** Nordby et al. arXiv 2604.13386, Limitations section
+
+Nordby et al.'s own Limitations section states: 'We evaluate within-family scaling but do not systematically test whether probes or ensemble configurations transfer across model families.' The paper reports family-specific patterns (e.g., Llama's strong Insider Trading performance) and notes that 'optimal approaches may not generalize, limiting practical applicability.' Best layer positions vary dramatically across architectures (Figure 3 shows Llama models with high variance versus Qwen's consistent 60–80% range). No universal two-layer ensemble improves performance across all tasks simultaneously. This directly challenges the generalizability of the 29-78% improvement claim beyond within-family scaling.
--- a/domains/ai-alignment/multi-layer-ensemble-probes-provide-black-box-robustness-but-not-white-box-protection-against-scav-attacks.md
+++ b/domains/ai-alignment/multi-layer-ensemble-probes-provide-black-box-robustness-but-not-white-box-protection-against-scav-attacks.md
@ -30,3 +30,10 @@ The moderating claim that multi-layer ensemble probes provide black-box robustne
 **Source:** Schnoor et al. 2025, arXiv 2509.22755

 CAV-based monitoring techniques exhibit fundamental sensitivity to non-concept distribution choice (Schnoor et al., arXiv 2509.22755). The authors demonstrate that CAVs are random vectors whose distribution depends heavily on the arbitrary choice of non-concept examples used during training. They present an adversarial attack on TCAV (Testing with CAVs) that exploits this distributional dependence. This suggests cross-architecture concept direction transfer faces distributional incompatibility beyond architectural differences alone—even within a single model, CAV reliability depends on training distribution choices that would necessarily differ across model families.
+
+
+## Extending Evidence
+
+**Source:** Nordby et al. arXiv 2604.13386, Limitations + empirical results
+
+Nordby et al. provides indirect empirical evidence for architecture-specificity of rotation patterns through probe non-generalization. Family-specific probe performance patterns, dramatic variance in optimal layer positions across architectures, and absence of universal ensemble configurations suggest that rotation patterns are architecture-dependent. The paper notes 'tens to hundreds of deception related directions' in larger models, indicating complex, architecture-specific geometry. This supports the hypothesis that black-box multi-layer SCAV attacks would fail against closed-source models with different architectures, strengthening the 'Nordby wins for closed-source deployments' resolution. However, the paper contains no adversarial robustness evaluation whatsoever—all results are on clean data. Confidence upgrades from speculative to experimental based on indirect evidence.
--- a/domains/ai-alignment/rotation-pattern-universality-determines-black-box-multi-layer-scav-feasibility.md
+++ b/domains/ai-alignment/rotation-pattern-universality-determines-black-box-multi-layer-scav-feasibility.md
@ -24,3 +24,10 @@ The feasibility of black-box multi-layer SCAV attacks depends on whether the rot
 **Source:** Schnoor et al. 2025, arXiv 2509.22755

 Theoretical analysis from XAI literature shows CAVs (Concept Activation Vectors) are fundamentally fragile to non-concept distribution choice (Schnoor et al., arXiv 2509.22755). Since non-concept distributions necessarily differ across model architectures and training regimes, this provides theoretical grounding for why rotation patterns extracted via SCAV would fail to transfer across model families—the concept vectors themselves are unstable under distributional shifts inherent to cross-architecture application.
+
+
+## Extending Evidence
+
+**Source:** Nordby et al. arXiv 2604.13386
+
+Nordby et al. provides the strongest available indirect evidence on rotation pattern architecture-specificity, though it does not directly test cross-architecture transfer. The paper shows: (1) family-specific probe performance patterns that do not generalize, (2) dramatic variance in optimal layer positions across model families (Llama high variance vs Qwen consistent 60-80%), (3) no universal two-layer ensemble that improves all tasks, (4) task-optimal weighting differs substantially across deception types and families. The geometric analysis (R≈-0.435 correlation between geometric similarity and performance) applies only within single architectures—cross-architecture geometric analysis was not performed. This suggests rotation patterns are architecture-specific, but the question remains empirically unresolved for black-box SCAV attacks.
--- a/domains/grand-strategy/autonomous-weapons-prohibition-commercially-negotiable-under-competitive-pressure-proven-by-anthropic-missile-defense-carveout.md
+++ b/domains/grand-strategy/autonomous-weapons-prohibition-commercially-negotiable-under-competitive-pressure-proven-by-anthropic-missile-defense-carveout.md
@ -0,0 +1,19 @@
+---
+type: claim
+domain: grand-strategy
+description: Anthropic added a 'missile defense carveout' exempting autonomous missile interception systems from autonomous weapons prohibition, establishing precedent that categorical prohibitions erode through domain-specific exceptions under market pressure
+confidence: experimental
+source: Time Magazine exclusive, February 24, 2026; Anthropic RSP v3.0 use policy
+created: 2026-04-30
+title: Autonomous weapons prohibition is commercially negotiable under competitive pressure as proven by Anthropic's missile defense carveout in RSP v3
+agent: leo
+sourced_from: grand-strategy/2026-02-24-time-anthropic-rsp-v3-pause-commitment-dropped.md
+scope: structural
+sourcer: Time Magazine
+supports: ["definitional-ambiguity-in-autonomous-weapons-governance-is-strategic-interest-not-bureaucratic-failure-because-major-powers-preserve-programs-through-vague-thresholds", "voluntary-ai-safety-red-lines-are-structurally-equivalent-to-no-red-lines-when-lacking-constitutional-protection"]
+related: ["definitional-ambiguity-in-autonomous-weapons-governance-is-strategic-interest-not-bureaucratic-failure-because-major-powers-preserve-programs-through-vague-thresholds", "process-standard-autonomous-weapons-governance-creates-middle-ground-between-categorical-prohibition-and-unrestricted-deployment", "coercive-governance-instruments-deployed-for-future-optionality-preservation-not-current-harm-prevention-when-pentagon-designates-domestic-ai-labs-as-supply-chain-risks"]
+---
+
+# Autonomous weapons prohibition is commercially negotiable under competitive pressure as proven by Anthropic's missile defense carveout in RSP v3
+
+In RSP v3.0, Anthropic added a 'missile defense carveout'—autonomous missile interception systems are now exempted from the autonomous weapons prohibition in the use policy. This carveout was introduced simultaneously with the removal of binding pause commitments and on the same day as the Pentagon ultimatum to allow unrestricted military use of Claude. The missile defense carveout establishes a critical precedent: categorical prohibitions on autonomous weapons are commercially negotiable and erode through domain-specific exceptions when competitive or customer pressure is applied. The carveout is strategically significant because missile defense is a defensive application that can be framed as safety-enhancing, creating a wedge that distinguishes 'good' autonomous weapons (defensive) from 'bad' autonomous weapons (offensive). This distinction is precisely the kind of definitional ambiguity that major powers preserve to maintain program flexibility. The timing—same day as Pentagon pressure—suggests the carveout may have been part of negotiations or anticipatory compliance. Even if independently planned, the effect is that Anthropic's autonomous weapons prohibition now has an explicit exception, converting a categorical constraint into a negotiable boundary. This creates a template for future erosion: each domain-specific exception (missile defense, then perhaps counter-drone systems, then force protection) incrementally hollows out the prohibition until it becomes meaningless.
--- a/domains/grand-strategy/mutually-assured-deregulation-makes-voluntary-ai-governance-structurally-untenable-through-competitive-disadvantage-conversion.md
+++ b/domains/grand-strategy/mutually-assured-deregulation-makes-voluntary-ai-governance-structurally-untenable-through-competitive-disadvantage-conversion.md
@ -11,7 +11,7 @@ sourced_from: grand-strategy/2026-00-00-abiri-mutually-assured-deregulation-arxi
 scope: structural
 sourcer: Gilad Abiri
 supports: ["mandatory-legislative-governance-closes-technology-coordination-gap-while-voluntary-governance-widens-it", "global-capitalism-functions-as-a-misaligned-optimizer-that-produces-outcomes-no-participant-would-choose-because-individual-rationality-aggregates-into-collective-irrationality-without-coordination-mechanisms", "binding-international-governance-requires-commercial-migration-path-at-signing-not-low-competitive-stakes-at-inception"]
-related: ["mandatory-legislative-governance-closes-technology-coordination-gap-while-voluntary-governance-widens-it", "global-capitalism-functions-as-a-misaligned-optimizer-that-produces-outcomes-no-participant-would-choose-because-individual-rationality-aggregates-into-collective-irrationality-without-coordination-mechanisms", "ai-governance-discourse-capture-by-competitiveness-framing-inverts-china-us-participation-patterns", "mutually-assured-deregulation-makes-voluntary-ai-governance-structurally-untenable-through-competitive-disadvantage-conversion", "gilad-abiri"]
+related: ["mandatory-legislative-governance-closes-technology-coordination-gap-while-voluntary-governance-widens-it", "global-capitalism-functions-as-a-misaligned-optimizer-that-produces-outcomes-no-participant-would-choose-because-individual-rationality-aggregates-into-collective-irrationality-without-coordination-mechanisms", "ai-governance-discourse-capture-by-competitiveness-framing-inverts-china-us-participation-patterns", "mutually-assured-deregulation-makes-voluntary-ai-governance-structurally-untenable-through-competitive-disadvantage-conversion", "gilad-abiri", "ai-governance-failure-takes-four-structurally-distinct-forms-each-requiring-different-intervention"]
 ---

 # Mutually Assured Deregulation makes voluntary AI governance structurally untenable because each actor's restraint creates competitive disadvantage, converting the governance game from cooperation to prisoner's dilemma
@ -66,3 +66,10 @@ The Hegseth 'any lawful use' mandate (January 2026, 180-day implementation deadl
 **Source:** Gizmodo/TechCrunch/9to5Google, April 28 2026

 Google signed Pentagon classified AI deal on 'any lawful use' terms (with unenforceable advisory language) within 24 hours of 580+ employee petition demanding rejection, after removing weapons-related AI principles in February 2025. This confirms the MAD mechanism: voluntary safety constraints create competitive disadvantage, leading to erosion under competitive and policy pressure. The deal joins a 'broad consortium' including OpenAI and xAI, all on similar terms, demonstrating industry-wide convergence to minimum constraint.
+
+
+## Supporting Evidence
+
+**Source:** Anthropic RSP v3.0 documentation, February 24, 2026
+
+Anthropic explicitly invoked MAD logic in justifying RSP v3 changes: 'Stopping the training of AI models wouldn't actually help anyone if other developers with fewer scruples continue to advance' and 'Unilateral pauses are ineffective in a market where competitors continue to race forward.' This is the first documented case of a safety-committed lab explicitly using MAD reasoning to justify removing binding commitments.
--- a/domains/grand-strategy/rsp-v3-pause-commitment-drop-instantiates-mutually-assured-deregulation-at-corporate-voluntary-governance-level.md
+++ b/domains/grand-strategy/rsp-v3-pause-commitment-drop-instantiates-mutually-assured-deregulation-at-corporate-voluntary-governance-level.md
@ -0,0 +1,19 @@
+---
+type: claim
+domain: grand-strategy
+description: Anthropic explicitly invoked MAD logic ('stopping wouldn't help if competitors continue') to justify removing binding commitments, confirming the mechanism operates fractally across national, institutional, and corporate governance levels
+confidence: experimental
+source: Time Magazine exclusive, February 24, 2026; Anthropic RSP v3.0 documentation
+created: 2026-04-30
+title: RSP v3's substitution of non-binding Frontier Safety Roadmap for binding pause commitments instantiates Mutually Assured Deregulation at corporate voluntary governance level
+agent: leo
+sourced_from: grand-strategy/2026-02-24-time-anthropic-rsp-v3-pause-commitment-dropped.md
+scope: structural
+sourcer: Time Magazine
+supports: ["mutually-assured-deregulation-makes-voluntary-ai-governance-structurally-untenable-through-competitive-disadvantage-conversion", "voluntary-ai-safety-red-lines-are-structurally-equivalent-to-no-red-lines-when-lacking-constitutional-protection"]
+related: ["voluntary-ai-safety-constraints-lack-legal-enforcement-mechanism-when-primary-customer-demands-safety-unconstrained-alternatives", "mutually-assured-deregulation-makes-voluntary-ai-governance-structurally-untenable-through-competitive-disadvantage-conversion", "voluntary-ai-safety-red-lines-are-structurally-equivalent-to-no-red-lines-when-lacking-constitutional-protection", "mandatory-legislative-governance-closes-technology-coordination-gap-while-voluntary-governance-widens-it", "Anthropics RSP rollback under commercial pressure is the first empirical confirmation that binding safety commitments cannot survive the competitive dynamics of frontier AI development", "voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints", "voluntary-safety-constraints-without-enforcement-are-statements-of-intent-not-binding-governance", "voluntary-safety-constraints-without-external-enforcement-are-statements-of-intent-not-binding-governance"]
+---
+
+# RSP v3's substitution of non-binding Frontier Safety Roadmap for binding pause commitments instantiates Mutually Assured Deregulation at corporate voluntary governance level
+
+Anthropic's RSP v3.0 replaced the binding pause commitment from RSP v2 ('if we cannot implement adequate mitigations before reaching ASL-X, we will pause') with a non-binding 'Frontier Safety Roadmap.' The company's stated rationale directly invokes Mutually Assured Deregulation logic: 'Stopping the training of AI models wouldn't actually help anyone if other developers with fewer scruples continue to advance' and 'Some commitments in the old RSP only make sense if they're matched by other companies.' This is the same mechanism that makes national-level restraint untenable—competitors will advance without restraint, so unilateral restraint means falling behind with no safety benefit. The timing is significant: RSP v3.0 was released on February 24, 2026, the same day Defense Secretary Hegseth gave CEO Dario Amodei a 5pm deadline to allow unrestricted military use of Claude. Whether causally linked or coincidental, the binding safety mechanism was converted to non-binding at the moment of maximum external coercive pressure. GovAI's evolution from 'rather negative' to 'more positive' after deeper engagement suggests the safety community normalized the change relatively quickly, with the conclusion that it's 'better to be honest about constraints than to keep commitments that won't be followed in practice.' This reveals MAD operates not just at the national or institutional level, but cascades down to corporate voluntary governance—the same competitive logic that prevents nations from maintaining unilateral restraint prevents individual companies from maintaining binding safety commitments.
--- a/domains/grand-strategy/safety-leadership-exits-precede-voluntary-governance-policy-changes-as-leading-indicators-of-cumulative-competitive-pressure.md
+++ b/domains/grand-strategy/safety-leadership-exits-precede-voluntary-governance-policy-changes-as-leading-indicators-of-cumulative-competitive-pressure.md
@ -45,3 +45,10 @@ Google removed 'Applications we will not pursue' section from AI principles in F
 **Source:** Gizmodo/TechCrunch/9to5Google, April 28 2026

 The February 2025 removal of Google's weapons-related AI principles preceded the April 2026 classified deal signing by two months. The employee petition (580+ signatures including 20+ directors/VPs) had zero effect on deal terms or timing, with signing occurring 24 hours after petition publication. This demonstrates that principles removal is the outcome-determining event, with employee governance attempts failing completely once institutional leverage is eliminated.
+
+
+## Extending Evidence
+
+**Source:** Time Magazine exclusive and GovAI analysis, February 24, 2026
+
+RSP v3.0's removal of binding pause commitments occurred on February 24, 2026, extending the pattern of voluntary governance erosion. GovAI's rapid normalization (from 'rather negative' to 'more positive' after engagement) suggests the safety community adapted quickly to the change, with the rationale that 'better to be honest about constraints than to keep commitments that won't be followed in practice.'
--- a/domains/grand-strategy/voluntary-ai-safety-constraints-lack-legal-enforcement-mechanism-when-primary-customer-demands-safety-unconstrained-alternatives.md
+++ b/domains/grand-strategy/voluntary-ai-safety-constraints-lack-legal-enforcement-mechanism-when-primary-customer-demands-safety-unconstrained-alternatives.md
@ -181,3 +181,10 @@ Google's contract language dispute reveals the enforcement gap: proposed terms p
 **Source:** Google-Pentagon Gemini classified contract negotiations, April 2026

 Google's classified Pentagon contract negotiation confirms the pattern: Pentagon pushing 'all lawful uses' language, Google proposing process standards ('appropriate human control') rather than categorical prohibitions, employees demanding full rejection. The negotiation structure matches the three-tier stratification pattern with Google occupying the middle tier.
+
+
+## Supporting Evidence
+
+**Source:** Time Magazine exclusive, February 24, 2026
+
+Anthropic's RSP v3.0 removed binding pause commitments on February 24, 2026—the same day Defense Secretary Hegseth gave CEO Dario Amodei a 5pm deadline to allow unrestricted military use of Claude. Whether causally linked or coincidental, the binding safety mechanism was converted to non-binding at the moment of maximum external coercive pressure from the primary potential customer (Pentagon).
--- a/entities/grand-strategy/anthropic-rsp-v3.md
+++ b/entities/grand-strategy/anthropic-rsp-v3.md
@ -0,0 +1,54 @@
+# Anthropic RSP v3.0
+
+**Type:** Voluntary AI Safety Framework  
+**Released:** February 24, 2026  
+**Predecessor:** RSP v2 (October 2024)  
+**Status:** Active  
+
+## Overview
+
+Anthropic's Responsible Scaling Policy (RSP) v3.0 represents a significant shift from binding commitments to non-binding transparency mechanisms. Released on the same day Defense Secretary Hegseth gave CEO Dario Amodei a deadline for unrestricted military use of Claude.
+
+## Key Changes from RSP v2
+
+**Removed:**
+- Binding pause commitment: "if we cannot implement adequate mitigations before reaching ASL-X, we will pause"
+- Hard stop operational mechanism for development/deployment
+
+**Added:**
+- "Frontier Safety Roadmap" — detailed list of non-binding safety goals
+- "Risk Reports" — comprehensive risk assessments every 3-6 months (beyond current system cards)
+- Commitment to publicly grade progress toward goals
+- Commitment to match competitors' mitigations if more effective and implementable at similar cost
+- "Missile defense carveout" — autonomous missile interception systems exempted from autonomous weapons prohibition
+
+## Stated Rationale
+
+- "Stopping the training of AI models wouldn't actually help anyone if other developers with fewer scruples continue to advance"
+- "Some commitments in the old RSP only make sense if they're matched by other companies"
+- "Unilateral pauses are ineffective in a market where competitors continue to race forward"
+- Strategy of "non-binding but publicly-declared" targets borrows from transparency approaches championed for frontier AI legislation
+
+## External Reception
+
+**GovAI Analysis:**
+- Initial reaction: "rather negative, particularly concerned about the pause commitment being dropped"
+- After deeper engagement: "more positive"
+- Conclusion: "better to be honest about constraints than to keep commitments that won't be followed in practice"
+
+## Timeline
+
+- **October 2024** — RSP v2 released with binding pause commitments and ASL framework
+- **February 24, 2026** — RSP v3.0 released; same day as Hegseth ultimatum to Anthropic
+- **February 26, 2026** — Anthropic publicly refuses Pentagon terms (RSP v3 already released)
+- **February 27, 2026** — Pentagon designates Anthropic supply chain risk; $200M contract canceled
+
+## Significance
+
+RSP v3 represents the first documented case of a safety-committed AI lab explicitly invoking Mutually Assured Deregulation logic to justify removing binding safety commitments. The timing—same day as Pentagon ultimatum—makes it a key data point in understanding how voluntary governance erodes under competitive and coercive pressure.
+
+## Sources
+
+- Time Magazine exclusive, February 24, 2026
+- Anthropic RSP v3.0 documentation
+- GovAI analysis
--- a/inbox/archive/ai-alignment/2026-04-25-nordby-cross-model-limitations-family-specific-patterns.md
+++ b/inbox/archive/ai-alignment/2026-04-25-nordby-cross-model-limitations-family-specific-patterns.md
@ -7,9 +7,12 @@ date: 2026-04-25
 domain: ai-alignment
 secondary_domains: []
 format: preprint
-status: unprocessed
+status: processed
+processed_by: theseus
+processed_date: 2026-04-30
 priority: high
 tags: [representation-monitoring, linear-probes, multi-layer-ensemble, cross-model-generalization, rotation-patterns, adversarial-robustness, divergence-resolution, b4-verification]
+extraction_model: "anthropic/claude-sonnet-4.5"
 ---

 ## Content
--- a/inbox/archive/grand-strategy/2026-02-24-time-anthropic-rsp-v3-pause-commitment-dropped.md
+++ b/inbox/archive/grand-strategy/2026-02-24-time-anthropic-rsp-v3-pause-commitment-dropped.md
@ -7,9 +7,12 @@ date: 2026-02-24
 domain: grand-strategy
 secondary_domains: [ai-alignment]
 format: article
-status: unprocessed
+status: processed
+processed_by: leo
+processed_date: 2026-04-30
 priority: high
 tags: [anthropic, rsp-v3, pause-commitment, frontier-safety-roadmap, non-binding, mutually-assured-deregulation, voluntary-governance, safety-policy, pentagon, hegseth-ultimatum]
+extraction_model: "anthropic/claude-sonnet-4.5"
 ---

 ## Content