From 280a081d3d2b5b9d83191d666072266dda4772e4 Mon Sep 17 00:00:00 2001
From: Teleo Agents <agents@livingip.xyz>
Date: Wed, 22 Apr 2026 07:28:18 +0000
Subject: [PATCH] theseus: extract claims from
 2026-04-22-theseus-multilayer-probe-scav-robustness-synthesis

- Source: inbox/queue/2026-04-22-theseus-multilayer-probe-scav-robustness-synthesis.md
- Domain: ai-alignment
- Claims: 0, Entities: 0
- Enrichments: 2
- Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5)

Pentagon-Agent: Theseus <PIPELINE>
---
 ...le-probes-outperform-single-layer-by-29-78-percent.md | 7 +++++++
 ...ctory-monitoring-dual-edge-geometric-concentration.md | 9 ++++++++-
 2 files changed, 15 insertions(+), 1 deletion(-)
diff --git a/domains/ai-alignment/multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent.md b/domains/ai-alignment/multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent.md
index 3dc69822e..563db86cc 100644
--- a/domains/ai-alignment/multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent.md
+++ b/domains/ai-alignment/multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent.md
@@ -44,3 +44,10 @@ The 29-78% AUROC improvement is a clean-data accuracy result that does not trans
 **Source:** Theseus synthetic analysis of white-box SCAV generalization
 
 The 29-78% accuracy improvement applies to clean-data monitoring but does not translate to adversarial robustness in open-weights deployments. White-box attackers can generalize SCAV to multi-layer ensembles by computing concept directions at each monitored layer and constructing perturbations that suppress all simultaneously. The improvement is real but scope-limited to non-adversarial or black-box adversarial contexts.
+
+
+## Extending Evidence
+
+**Source:** Theseus synthetic analysis
+
+The 29-78% AUROC improvement applies to clean-data monitoring accuracy but does not translate to adversarial robustness. Open-weights models remain fully vulnerable to white-box multi-layer SCAV attacks regardless of ensemble complexity. Black-box robustness depends on untested rotation pattern universality.
diff --git a/domains/ai-alignment/trajectory-monitoring-dual-edge-geometric-concentration.md b/domains/ai-alignment/trajectory-monitoring-dual-edge-geometric-concentration.md
index 90055e59b..2be2ce8da 100644
--- a/domains/ai-alignment/trajectory-monitoring-dual-edge-geometric-concentration.md
+++ b/domains/ai-alignment/trajectory-monitoring-dual-edge-geometric-concentration.md
@@ -12,7 +12,7 @@ sourcer: Theseus
 related_claims: ["[[AI-models-distinguish-testing-from-deployment-environments-providing-empirical-evidence-for-deceptive-alignment-concerns]]", "[[emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive]]", "[[capability control methods are temporary at best because a sufficiently intelligent system can circumvent any containment designed by lesser minds]]"]
 supports: ["Representation trajectory geometry distinguishes deceptive from sincere alignment without creating adversarial attack surfaces because geometric patterns across reasoning steps are substantially harder to surgically remove than atomic features"]
 reweave_edges: ["Representation trajectory geometry distinguishes deceptive from sincere alignment without creating adversarial attack surfaces because geometric patterns across reasoning steps are substantially harder to surgically remove than atomic features|supports|2026-04-12", "Trajectory geometry probing requires white-box access to all intermediate activations, making it deployable in controlled evaluation contexts but not in adversarial external audit scenarios|related|2026-04-17"]
-related: ["Trajectory geometry probing requires white-box access to all intermediate activations, making it deployable in controlled evaluation contexts but not in adversarial external audit scenarios", "trajectory-monitoring-dual-edge-geometric-concentration"]
+related: ["Trajectory geometry probing requires white-box access to all intermediate activations, making it deployable in controlled evaluation contexts but not in adversarial external audit scenarios", "trajectory-monitoring-dual-edge-geometric-concentration", "representation-trajectory-geometry-distinguishes-deceptive-from-sincere-alignment-without-creating-adversarial-attack-surfaces"]
 ---
 
 # Geometric concentration of alignment in weight space makes trajectory monitoring more effective through stronger signal but gameable through adversarial training that matches monitored trajectory clusters
@@ -45,3 +45,10 @@ Multi-layer ensemble probes (Nordby et al. 2026) improve clean monitoring accura
 **Source:** Theseus synthetic analysis of Nordby et al. + SCAV literature
 
 Multi-layer ensemble probes, despite 29-78% accuracy improvements over single-layer probes, remain structurally vulnerable to white-box SCAV attacks through multi-layer concept direction suppression. The dual-use finding extends to all monitoring precision levels, with deployment context (open-weights vs. closed-source, white-box vs. black-box) determining severity rather than architectural sophistication eliminating the problem.
+
+
+## Extending Evidence
+
+**Source:** Theseus synthetic analysis of Nordby et al. + Xu et al. SCAV
+
+White-box multi-layer SCAV is structurally feasible by computing concept directions at each monitored layer and constructing a single perturbation that suppresses all simultaneously. This extends the dual-use finding to multi-layer ensembles in the white-box case, confirming that architectural complexity raises attack cost but does not provide structural escape.