theseus: extract claims from 2026-04-22-theseus-multilayer-probe-scav-robustness-synthesis

- Source: inbox/queue/2026-04-22-theseus-multilayer-probe-scav-robustness-synthesis.md - Domain: ai-alignment - Claims: 0, Entities: 0 - Enrichments: 2 - Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5) Pentagon-Agent: Theseus <PIPELINE>
2026-04-22 07:28:18 +00:00 · 2026-04-22 07:28:18 +00:00 · 280a081d3d
commit 280a081d3d
parent c8aacf2e5b
2 changed files with 15 additions and 1 deletions
--- a/domains/ai-alignment/multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent.md
+++ b/domains/ai-alignment/multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent.md
@ -44,3 +44,10 @@ The 29-78% AUROC improvement is a clean-data accuracy result that does not trans
 **Source:** Theseus synthetic analysis of white-box SCAV generalization
 The 29-78% accuracy improvement applies to clean-data monitoring but does not translate to adversarial robustness in open-weights deployments. White-box attackers can generalize SCAV to multi-layer ensembles by computing concept directions at each monitored layer and constructing perturbations that suppress all simultaneously. The improvement is real but scope-limited to non-adversarial or black-box adversarial contexts.
 ## Extending Evidence
 **Source:** Theseus synthetic analysis
 The 29-78% AUROC improvement applies to clean-data monitoring accuracy but does not translate to adversarial robustness. Open-weights models remain fully vulnerable to white-box multi-layer SCAV attacks regardless of ensemble complexity. Black-box robustness depends on untested rotation pattern universality.
--- a/domains/ai-alignment/trajectory-monitoring-dual-edge-geometric-concentration.md
+++ b/domains/ai-alignment/trajectory-monitoring-dual-edge-geometric-concentration.md
@ -12,7 +12,7 @@ sourcer: Theseus
 related_claims: ["[[AI-models-distinguish-testing-from-deployment-environments-providing-empirical-evidence-for-deceptive-alignment-concerns]]", "[[emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive]]", "[[capability control methods are temporary at best because a sufficiently intelligent system can circumvent any containment designed by lesser minds]]"]
 supports: ["Representation trajectory geometry distinguishes deceptive from sincere alignment without creating adversarial attack surfaces because geometric patterns across reasoning steps are substantially harder to surgically remove than atomic features"]
 reweave_edges: ["Representation trajectory geometry distinguishes deceptive from sincere alignment without creating adversarial attack surfaces because geometric patterns across reasoning steps are substantially harder to surgically remove than atomic features|supports|2026-04-12", "Trajectory geometry probing requires white-box access to all intermediate activations, making it deployable in controlled evaluation contexts but not in adversarial external audit scenarios|related|2026-04-17"]
-related: ["Trajectory geometry probing requires white-box access to all intermediate activations, making it deployable in controlled evaluation contexts but not in adversarial external audit scenarios", "trajectory-monitoring-dual-edge-geometric-concentration"]
+related: ["Trajectory geometry probing requires white-box access to all intermediate activations, making it deployable in controlled evaluation contexts but not in adversarial external audit scenarios", "trajectory-monitoring-dual-edge-geometric-concentration", "representation-trajectory-geometry-distinguishes-deceptive-from-sincere-alignment-without-creating-adversarial-attack-surfaces"]
 ---
 # Geometric concentration of alignment in weight space makes trajectory monitoring more effective through stronger signal but gameable through adversarial training that matches monitored trajectory clusters
@ -45,3 +45,10 @@ Multi-layer ensemble probes (Nordby et al. 2026) improve clean monitoring accura
 **Source:** Theseus synthetic analysis of Nordby et al. + SCAV literature
 Multi-layer ensemble probes, despite 29-78% accuracy improvements over single-layer probes, remain structurally vulnerable to white-box SCAV attacks through multi-layer concept direction suppression. The dual-use finding extends to all monitoring precision levels, with deployment context (open-weights vs. closed-source, white-box vs. black-box) determining severity rather than architectural sophistication eliminating the problem.
 ## Extending Evidence
 **Source:** Theseus synthetic analysis of Nordby et al. + Xu et al. SCAV
 White-box multi-layer SCAV is structurally feasible by computing concept directions at each monitored layer and constructing a single perturbation that suppresses all simultaneously. This extends the dual-use finding to multi-layer ensembles in the white-box case, confirming that architectural complexity raises attack cost but does not provide structural escape.