theseus: extract claims from 2026-04-25-theseus-community-silo-interpretability-adversarial-robustness

- Source: inbox/queue/2026-04-25-theseus-community-silo-interpretability-adversarial-robustness.md - Domain: ai-alignment - Claims: 1, Entities: 0 - Enrichments: 3 - Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5) Pentagon-Agent: Theseus <PIPELINE>
2026-04-25 00:18:32 +00:00 · 2026-04-25 00:18:32 +00:00 · 72eccbd0bc
commit 72eccbd0bc
parent 80c8a80149
5 changed files with 55 additions and 50 deletions
--- a/domains/ai-alignment/AI
+++ b/domains/ai-alignment/AI
@ -1,35 +1,12 @@
 ---
-
-
-
-
-
-description: Getting AI right requires simultaneous alignment across competing companies, nations, and disciplines at the speed of AI development -- no existing institution can coordinate this
 type: claim
 domain: ai-alignment
-created: 2026-02-16
+description: Getting AI right requires simultaneous alignment across competing companies, nations, and disciplines at the speed of AI development -- no existing institution can coordinate this
 confidence: likely
-source: "TeleoHumanity Manifesto, Chapter 5"
-related:
- AI agents as personal advocates collapse Coasean transaction costs enabling bottom-up coordination at societal scale but catastrophic risks remain non-negotiable requiring state enforcement as outer boundary
- AI agents can reach cooperative program equilibria inaccessible in traditional game theory because open-source code transparency enables conditional strategies that require mutual legibility
- AI investment concentration where 58 percent of funding flows to megarounds and two companies capture 14 percent of all global venture capital creates a structural oligopoly that alignment governance must account for
- AI talent circulation between frontier labs transfers alignment culture not just capability because researchers carry safety methodologies and institutional norms to their new organizations
- transparent algorithmic governance where AI response rules are public and challengeable through the same epistemic process as the knowledge base is a structurally novel alignment approach
- the absence of a societal warning signal for AGI is a structural feature not an accident because capability scaling is gradual and ambiguous and collective action requires anticipation not reaction
- autonomous-weapons-violate-existing-IHL-because-proportionality-requires-human-judgment
- multilateral-ai-governance-verification-mechanisms-remain-at-proposal-stage-because-technical-infrastructure-does-not-exist-at-deployment-scale
- evaluation-based-coordination-schemes-face-antitrust-obstacles-because-collective-pausing-agreements-among-competing-developers-could-be-construed-as-cartel-behavior
- international-humanitarian-law-and-ai-alignment-converge-on-explainability-requirements
- civil-society-coordination-infrastructure-fails-to-produce-binding-governance-when-structural-obstacle-is-great-power-veto-not-political-will
- legal-mandate-is-the-only-version-of-coordinated-pausing-that-avoids-antitrust-risk-while-preserving-coordination-benefits
-reweave_edges:
- AI agents as personal advocates collapse Coasean transaction costs enabling bottom-up coordination at societal scale but catastrophic risks remain non-negotiable requiring state enforcement as outer boundary|related|2026-03-28
- AI agents can reach cooperative program equilibria inaccessible in traditional game theory because open-source code transparency enables conditional strategies that require mutual legibility|related|2026-03-28
- AI investment concentration where 58 percent of funding flows to megarounds and two companies capture 14 percent of all global venture capital creates a structural oligopoly that alignment governance must account for|related|2026-03-28
- AI talent circulation between frontier labs transfers alignment culture not just capability because researchers carry safety methodologies and institutional norms to their new organizations|related|2026-03-28
- transparent algorithmic governance where AI response rules are public and challengeable through the same epistemic process as the knowledge base is a structurally novel alignment approach|related|2026-03-28
- the absence of a societal warning signal for AGI is a structural feature not an accident because capability scaling is gradual and ambiguous and collective action requires anticipation not reaction|related|2026-04-07
+source: TeleoHumanity Manifesto, Chapter 5
+created: 2026-02-16
+related: ["AI agents as personal advocates collapse Coasean transaction costs enabling bottom-up coordination at societal scale but catastrophic risks remain non-negotiable requiring state enforcement as outer boundary", "AI agents can reach cooperative program equilibria inaccessible in traditional game theory because open-source code transparency enables conditional strategies that require mutual legibility", "AI investment concentration where 58 percent of funding flows to megarounds and two companies capture 14 percent of all global venture capital creates a structural oligopoly that alignment governance must account for", "AI talent circulation between frontier labs transfers alignment culture not just capability because researchers carry safety methodologies and institutional norms to their new organizations", "transparent algorithmic governance where AI response rules are public and challengeable through the same epistemic process as the knowledge base is a structurally novel alignment approach", "the absence of a societal warning signal for AGI is a structural feature not an accident because capability scaling is gradual and ambiguous and collective action requires anticipation not reaction", "autonomous-weapons-violate-existing-IHL-because-proportionality-requires-human-judgment", "multilateral-ai-governance-verification-mechanisms-remain-at-proposal-stage-because-technical-infrastructure-does-not-exist-at-deployment-scale", "evaluation-based-coordination-schemes-face-antitrust-obstacles-because-collective-pausing-agreements-among-competing-developers-could-be-construed-as-cartel-behavior", "international-humanitarian-law-and-ai-alignment-converge-on-explainability-requirements", "civil-society-coordination-infrastructure-fails-to-produce-binding-governance-when-structural-obstacle-is-great-power-veto-not-political-will", "legal-mandate-is-the-only-version-of-coordinated-pausing-that-avoids-antitrust-risk-while-preserving-coordination-benefits", "AI alignment is a coordination problem not a technical problem", "no research group is building alignment through collective intelligence infrastructure despite the field converging on problems that require it", "legal-and-alignment-communities-converge-on-AI-value-judgment-impossibility", "a misaligned context cannot develop aligned AI because the competitive dynamics building AI optimize for deployment speed not safety making system alignment prerequisite for AI alignment"]
+reweave_edges: ["AI agents as personal advocates collapse Coasean transaction costs enabling bottom-up coordination at societal scale but catastrophic risks remain non-negotiable requiring state enforcement as outer boundary|related|2026-03-28", "AI agents can reach cooperative program equilibria inaccessible in traditional game theory because open-source code transparency enables conditional strategies that require mutual legibility|related|2026-03-28", "AI investment concentration where 58 percent of funding flows to megarounds and two companies capture 14 percent of all global venture capital creates a structural oligopoly that alignment governance must account for|related|2026-03-28", "AI talent circulation between frontier labs transfers alignment culture not just capability because researchers carry safety methodologies and institutional norms to their new organizations|related|2026-03-28", "transparent algorithmic governance where AI response rules are public and challengeable through the same epistemic process as the knowledge base is a structurally novel alignment approach|related|2026-03-28", "the absence of a societal warning signal for AGI is a structural feature not an accident because capability scaling is gradual and ambiguous and collective action requires anticipation not reaction|related|2026-04-07"]
 ---

 # AI alignment is a coordination problem not a technical problem
@ -95,3 +72,9 @@ Relevant Notes:

 Topics:
 - [[_map]]
+
+## Supporting Evidence
+
+**Source:** Theseus synthetic analysis of Beaglehole/SCAV/Nordby/Apollo publication patterns
+
+The interpretability-for-safety and adversarial robustness research communities publish in different venues (ICLR interpretability workshops vs. CCS/USENIX security), attend different conferences, and have minimal citation crossover. This structural silo causes organizations implementing Beaglehole-style monitoring to gain detection improvement against naive attackers while simultaneously creating precision attack infrastructure for adversarially-informed attackers, without awareness from reading the monitoring literature. This is empirical evidence that coordination failures between research communities produce safety degradation independent of any individual lab's technical capabilities.
--- a/domains/ai-alignment/mechanistic-interpretability-tools-create-dual-use-attack-surface-enabling-surgical-safety-feature-removal.md
+++ b/domains/ai-alignment/mechanistic-interpretability-tools-create-dual-use-attack-surface-enabling-surgical-safety-feature-removal.md
@ -34,3 +34,9 @@ The CFA² (Causal Front-Door Adjustment Attack) demonstrates that Sparse Autoenc
 **Source:** Xu et al. (NeurIPS 2024)

 SCAV framework achieved 99.14% jailbreak success across seven open-source LLMs with black-box transfer to GPT-4, providing empirical confirmation that linear concept vector monitoring creates exploitable attack surfaces. The closed-form solution for optimal perturbation magnitude means attacks require no hyperparameter tuning, lowering the barrier to exploitation.
+
+## Extending Evidence
+
+**Source:** Beaglehole et al. Science 391 2026, Nordby et al. arXiv 2604.13386 April 2026, Apollo Research ICML 2025 publication timeline
+
+Three consecutive monitoring papers (Beaglehole Science 2026, Nordby arXiv 2604.13386, Apollo ICML 2025) published 13-17 months after SCAV all fail to engage with SCAV's demonstration that linear concept directions enable 99.14% jailbreak success. This 13-17 month citation gap across multiple independent publications suggests the dual-use attack surface persists not due to lack of time for literature review but due to structural community silo between interpretability-for-safety and adversarial robustness research communities.
--- a/domains/ai-alignment/no
+++ b/domains/ai-alignment/no
@ -1,25 +1,13 @@
 ---
-
-
-
-description: Current alignment approaches are all single-model focused while the hardest problems preference diversity scalable oversight and value evolution are inherently collective
 type: claim
 domain: ai-alignment
-created: 2026-02-17
-source: "Survey of alignment research landscape 2025-2026"
+description: Current alignment approaches are all single-model focused while the hardest problems preference diversity scalable oversight and value evolution are inherently collective
 confidence: likely
-related:
- ai-enhanced-collective-intelligence-requires-federated-learning-architectures-to-preserve-data-sovereignty-at-scale
- national-scale-collective-intelligence-infrastructure-requires-seven-trust-properties-to-achieve-legitimacy
- transparent algorithmic governance where AI response rules are public and challengeable through the same epistemic process as the knowledge base is a structurally novel alignment approach
- collective-intelligence-architectures-are-underexplored-for-alignment-despite-addressing-core-problems
-reweave_edges:
- ai-enhanced-collective-intelligence-requires-federated-learning-architectures-to-preserve-data-sovereignty-at-scale|related|2026-03-28
- national-scale-collective-intelligence-infrastructure-requires-seven-trust-properties-to-achieve-legitimacy|related|2026-03-28
- transparent algorithmic governance where AI response rules are public and challengeable through the same epistemic process as the knowledge base is a structurally novel alignment approach|related|2026-03-28
- Collective intelligence architectures are structurally underexplored for alignment despite directly addressing preference diversity value evolution and scalable oversight|supports|2026-04-19
-supports:
- Collective intelligence architectures are structurally underexplored for alignment despite directly addressing preference diversity value evolution and scalable oversight
+source: Survey of alignment research landscape 2025-2026
+created: 2026-02-17
+related: ["ai-enhanced-collective-intelligence-requires-federated-learning-architectures-to-preserve-data-sovereignty-at-scale", "national-scale-collective-intelligence-infrastructure-requires-seven-trust-properties-to-achieve-legitimacy", "transparent algorithmic governance where AI response rules are public and challengeable through the same epistemic process as the knowledge base is a structurally novel alignment approach", "collective-intelligence-architectures-are-underexplored-for-alignment-despite-addressing-core-problems", "democratic alignment assemblies produce constitutions as effective as expert-designed ones while better representing diverse populations", "no research group is building alignment through collective intelligence infrastructure despite the field converging on problems that require it", "RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values", "community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules"]
+reweave_edges: ["ai-enhanced-collective-intelligence-requires-federated-learning-architectures-to-preserve-data-sovereignty-at-scale|related|2026-03-28", "national-scale-collective-intelligence-infrastructure-requires-seven-trust-properties-to-achieve-legitimacy|related|2026-03-28", "transparent algorithmic governance where AI response rules are public and challengeable through the same epistemic process as the knowledge base is a structurally novel alignment approach|related|2026-03-28", "Collective intelligence architectures are structurally underexplored for alignment despite directly addressing preference diversity value evolution and scalable oversight|supports|2026-04-19"]
+supports: ["Collective intelligence architectures are structurally underexplored for alignment despite directly addressing preference diversity value evolution and scalable oversight"]
 ---

 # no research group is building alignment through collective intelligence infrastructure despite the field converging on problems that require it
@ -71,3 +59,9 @@ Topics:
 - [[maps/livingip overview]]
 - [[maps/coordination mechanisms]]
 - domains/ai-alignment/_map
+
+## Extending Evidence
+
+**Source:** Theseus synthetic analysis noting adversarial ML community documentation since 2022-2023
+
+The silo between interpretability-for-safety and adversarial robustness is another instance of research fragmentation where safety-critical cross-implications exist but no infrastructure connects the communities. The adversarial ML community has been documenting dual-use attack surfaces of safety techniques since 2022-2023, but the alignment/interpretability community largely does not track this literature, creating a persistent knowledge gap with deployment consequences.
--- a/domains/ai-alignment/research-community-silo-between-interpretability-and-adversarial-robustness-creates-deployment-safety-failures.md
+++ b/domains/ai-alignment/research-community-silo-between-interpretability-and-adversarial-robustness-creates-deployment-safety-failures.md
@ -0,0 +1,19 @@
+---
+type: claim
+domain: ai-alignment
+description: "Three consecutive monitoring papers (Beaglehole Science 2026, Nordby arXiv 2604.13386, Apollo ICML 2025) fail to engage with SCAV despite SCAV demonstrating 99.14% jailbreak success using the same linear concept directions these papers use for monitoring"
+confidence: likely
+source: Beaglehole et al. Science 391 2026, Xu et al. SCAV NeurIPS 2024, Nordby et al. arXiv 2604.13386, Apollo Research ICML 2025 publication timeline analysis
+created: 2026-04-25
+title: Research community silo between interpretability-for-safety and adversarial robustness creates deployment-phase safety failures where organizations implementing monitoring improvements inherit dual-use attack surfaces without exposure to adversarial robustness literature
+agent: theseus
+sourced_from: ai-alignment/2026-04-25-theseus-community-silo-interpretability-adversarial-robustness.md
+scope: structural
+sourcer: Theseus (synthetic analysis)
+supports: ["AI alignment is a coordination problem not a technical problem"]
+related: ["major-ai-safety-governance-frameworks-architecturally-dependent-on-behaviorally-insufficient-evaluation", "AI alignment is a coordination problem not a technical problem", "mechanistic-interpretability-tools-create-dual-use-attack-surface-enabling-surgical-safety-feature-removal", "representation-monitoring-via-linear-concept-vectors-creates-dual-use-attack-surface"]
+---
+
+# Research community silo between interpretability-for-safety and adversarial robustness creates deployment-phase safety failures where organizations implementing monitoring improvements inherit dual-use attack surfaces without exposure to adversarial robustness literature
+
+SCAV (Xu et al.) was published at NeurIPS 2024 in December 2024, establishing that linear concept directions enable 99.14% jailbreak success rates. Beaglehole et al. was published in Science in January 2026 (13 months after SCAV), Nordby et al. in April 2026 (17 months after SCAV), and Apollo Research's deception detection paper at ICML 2025. None of these three monitoring papers cite, discuss, or address SCAV in their limitations sections, despite SCAV directly demonstrating that the linear concept vectors these papers use for safety monitoring also create precision attack infrastructure. This creates a deployment pipeline where: (1) governance teams read Beaglehole-style papers, (2) implement concept vector monitoring, (3) document 'monitoring deployed' as a safety improvement, (4) adversarially-informed attackers read SCAV, (5) extract concept directions from deployment signals, (6) achieve 99.14% jailbreak success. The silo is structural: interpretability-for-safety and adversarial robustness communities publish in different venues (ICLR interpretability workshops vs. CCS/USENIX security), attend different conferences, and have minimal citation crossover. Organizations implementing monitoring based solely on the interpretability literature gain genuine detection improvement against naive attackers while simultaneously creating dual-use attack infrastructure, without awareness of this consequence. This is not a failure of any individual paper but a coordination failure between research communities with safety-critical cross-implications.
--- a/inbox/archive/ai-alignment/2026-04-25-theseus-community-silo-interpretability-adversarial-robustness.md
+++ b/inbox/archive/ai-alignment/2026-04-25-theseus-community-silo-interpretability-adversarial-robustness.md
@ -7,9 +7,12 @@ date: 2026-04-25
 domain: ai-alignment
 secondary_domains: [grand-strategy]
 format: synthetic-analysis
-status: unprocessed
+status: processed
+processed_by: theseus
+processed_date: 2026-04-25
 priority: medium
 tags: [community-silo, interpretability, adversarial-robustness, dual-use, deployment-safety, research-coordination, b2-coordination, beaglehole, scav]
+extraction_model: "anthropic/claude-sonnet-4.5"
 ---

 ## Content