From a832cb99c004d159b9176c8e34e2119e11476c59 Mon Sep 17 00:00:00 2001 From: Teleo Agents Date: Wed, 22 Apr 2026 05:28:47 +0000 Subject: [PATCH] theseus: extract claims from 2026-04-22-theseus-multilayer-probe-scav-robustness-synthesis - Source: inbox/queue/2026-04-22-theseus-multilayer-probe-scav-robustness-synthesis.md - Domain: ai-alignment - Claims: 0, Entities: 0 - Enrichments: 3 - Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5) Pentagon-Agent: Theseus --- ...mble-probes-outperform-single-layer-by-29-78-percent.md | 7 +++++++ ...ignment-without-creating-adversarial-attack-surfaces.md | 7 +++++++ ...jectory-monitoring-dual-edge-geometric-concentration.md | 7 +++++++ 3 files changed, 21 insertions(+) diff --git a/domains/ai-alignment/multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent.md b/domains/ai-alignment/multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent.md index caa5bbd89..3dc69822e 100644 --- a/domains/ai-alignment/multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent.md +++ b/domains/ai-alignment/multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent.md @@ -37,3 +37,10 @@ Nordby et al.'s clean-data accuracy improvements do not translate to adversarial **Source:** Theseus adversarial robustness analysis The 29-78% AUROC improvement is a clean-data accuracy result that does not translate to adversarial robustness. Nordby et al. contains no adversarial evaluation. White-box multi-layer SCAV is structurally feasible against these ensembles in open-weights models. The accuracy improvement is real but scoped to non-adversarial monitoring contexts. + + +## Extending Evidence + +**Source:** Theseus synthetic analysis of white-box SCAV generalization + +The 29-78% accuracy improvement applies to clean-data monitoring but does not translate to adversarial robustness in open-weights deployments. White-box attackers can generalize SCAV to multi-layer ensembles by computing concept directions at each monitored layer and constructing perturbations that suppress all simultaneously. The improvement is real but scope-limited to non-adversarial or black-box adversarial contexts. diff --git a/domains/ai-alignment/representation-trajectory-geometry-distinguishes-deceptive-from-sincere-alignment-without-creating-adversarial-attack-surfaces.md b/domains/ai-alignment/representation-trajectory-geometry-distinguishes-deceptive-from-sincere-alignment-without-creating-adversarial-attack-surfaces.md index 9fc0dc718..b0472c2cd 100644 --- a/domains/ai-alignment/representation-trajectory-geometry-distinguishes-deceptive-from-sincere-alignment-without-creating-adversarial-attack-surfaces.md +++ b/domains/ai-alignment/representation-trajectory-geometry-distinguishes-deceptive-from-sincere-alignment-without-creating-adversarial-attack-surfaces.md @@ -24,3 +24,10 @@ The method characterizes alignment-relevant properties as geometric patterns in **Source:** Theseus synthetic analysis (2026-04-22) Trajectory geometry monitoring does create adversarial attack surfaces through multi-layer SCAV generalization. The claim that geometric approaches avoid attack surface creation is incorrect for white-box adversaries with access to model weights. May need qualification to 'black-box adversaries only, contingent on rotation pattern specificity.' + + +## Challenging Evidence + +**Source:** Theseus synthetic analysis of SCAV generalization to multi-layer ensembles + +Multi-layer ensemble analysis shows trajectory geometry monitoring DOES create attack surfaces in white-box settings. While multi-layer ensembles are harder to exploit than single-layer probes, white-box multi-layer SCAV is structurally feasible through simultaneous suppression of concept directions at all monitored layers. The claim that trajectory geometry avoids attack surfaces may need qualification to 'reduces attack surface in black-box settings if rotation patterns are model-specific.' diff --git a/domains/ai-alignment/trajectory-monitoring-dual-edge-geometric-concentration.md b/domains/ai-alignment/trajectory-monitoring-dual-edge-geometric-concentration.md index bc3c8682a..90055e59b 100644 --- a/domains/ai-alignment/trajectory-monitoring-dual-edge-geometric-concentration.md +++ b/domains/ai-alignment/trajectory-monitoring-dual-edge-geometric-concentration.md @@ -38,3 +38,10 @@ The dual-use vulnerability extends to multi-layer ensemble monitoring, not just **Source:** Theseus synthetic analysis of Nordby et al. (arXiv 2604.13386, April 2026) Multi-layer ensemble probes (Nordby et al. 2026) improve clean monitoring accuracy 29-78% but provide no structural protection against white-box adversaries in open-weights models. White-box multi-layer SCAV can compute concept directions at each monitored layer and construct a single perturbation suppressing all simultaneously. The dual-use finding extends to all monitoring precision levels with scope qualification: open-weights models face structural vulnerability regardless of ensemble complexity; closed-source models may gain genuine black-box protection if rotation patterns are model-specific (untested). + + +## Extending Evidence + +**Source:** Theseus synthetic analysis of Nordby et al. + SCAV literature + +Multi-layer ensemble probes, despite 29-78% accuracy improvements over single-layer probes, remain structurally vulnerable to white-box SCAV attacks through multi-layer concept direction suppression. The dual-use finding extends to all monitoring precision levels, with deployment context (open-weights vs. closed-source, white-box vs. black-box) determining severity rather than architectural sophistication eliminating the problem.