From f735da7ba3550f31ac41b8bd0d21f7f71fbc6682 Mon Sep 17 00:00:00 2001 From: Teleo Agents Date: Wed, 22 Apr 2026 02:05:49 +0000 Subject: [PATCH] theseus: extract claims from 2026-04-22-theseus-multilayer-probe-scav-robustness-synthesis - Source: inbox/queue/2026-04-22-theseus-multilayer-probe-scav-robustness-synthesis.md - Domain: ai-alignment - Claims: 0, Entities: 0 - Enrichments: 3 - Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5) Pentagon-Agent: Theseus --- ...mble-probes-outperform-single-layer-by-29-78-percent.md | 7 +++++++ ...near-concept-vectors-creates-dual-use-attack-surface.md | 7 +++++++ ...jectory-monitoring-dual-edge-geometric-concentration.md | 7 +++++++ 3 files changed, 21 insertions(+) diff --git a/domains/ai-alignment/multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent.md b/domains/ai-alignment/multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent.md index a027733f5..f81d27d14 100644 --- a/domains/ai-alignment/multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent.md +++ b/domains/ai-alignment/multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent.md @@ -23,3 +23,10 @@ Nordby et al. found that multi-layer ensemble probes improved AUROC by 29 percen **Source:** Theseus synthesis of Nordby et al. (arXiv 2604.13386, April 2026) The 29-78% AUROC improvement occurs because deception-relevant concept directions rotate across layers rather than remaining fixed in the residual stream. Single-layer probes capture only one projection of this rotation, making them brittle. Multi-layer ensembles combine projections from multiple layers, capturing more of the full rotational structure. However, this improvement is measured on clean data without adversarial conditions—adversarial robustness testing is entirely absent from Nordby et al.'s scope. + + +## Extending Evidence + +**Source:** Theseus synthesis of Nordby et al. (arXiv 2604.13386) + +The performance improvement comes from capturing the rotation of deception-relevant concept directions across layers rather than a fixed direction. Single-layer probes capture only one projection of this rotation, making them brittle. Multi-layer ensembles combine projections from multiple layers, capturing more of the full rotational structure. This explains the mechanism behind the AUROC improvement. diff --git a/domains/ai-alignment/representation-monitoring-via-linear-concept-vectors-creates-dual-use-attack-surface.md b/domains/ai-alignment/representation-monitoring-via-linear-concept-vectors-creates-dual-use-attack-surface.md index b58b629a1..96f6c1151 100644 --- a/domains/ai-alignment/representation-monitoring-via-linear-concept-vectors-creates-dual-use-attack-surface.md +++ b/domains/ai-alignment/representation-monitoring-via-linear-concept-vectors-creates-dual-use-attack-surface.md @@ -23,3 +23,10 @@ Xu et al. introduce SCAV (Steering Concept Activation Vectors), which identifies **Source:** Theseus synthetic analysis combining Nordby et al. and Xu et al. SCAV Multi-layer ensemble probes do not escape the dual-use attack surface identified for single-layer probes. With white-box access, SCAV can be generalized to compute concept directions at each monitored layer and construct a single perturbation suppressing all simultaneously. This is a higher-dimensional optimization requiring more computation and data, but is structurally feasible by the same mechanism. Open-weights models (Llama, Mistral, Falcon) remain fully vulnerable to white-box multi-layer SCAV regardless of ensemble complexity. + + +## Extending Evidence + +**Source:** Theseus synthesis of Nordby × SCAV (2026-04-22) + +Multi-layer ensemble architectures do not escape the dual-use structure — they shift the attack surface to higher-dimensional optimization. White-box multi-layer SCAV is structurally feasible by generalizing the single-layer attack to suppress concept directions at multiple layers simultaneously. No architectural escape exists for white-box adversaries. Black-box robustness depends on whether rotation patterns (how concept directions change across layers) are universal or model-specific, which is currently untested. diff --git a/domains/ai-alignment/trajectory-monitoring-dual-edge-geometric-concentration.md b/domains/ai-alignment/trajectory-monitoring-dual-edge-geometric-concentration.md index eeddcea40..fdc0d39ba 100644 --- a/domains/ai-alignment/trajectory-monitoring-dual-edge-geometric-concentration.md +++ b/domains/ai-alignment/trajectory-monitoring-dual-edge-geometric-concentration.md @@ -24,3 +24,10 @@ Weight-space alignment geometry research (2602.15799) establishes that alignment **Source:** Theseus synthesis addressing Sessions 29-30 open question The dual-use finding now extends to multi-layer ensemble monitoring with deployment-context qualification: open-weights models face white-box multi-layer SCAV attacks that architectural improvements cannot prevent, while closed-source models may gain genuine black-box protection if concept direction rotation patterns are model-specific rather than universal. The monitoring precision hierarchy holds across all levels, but the severity of dual-use risk depends on whether attackers have white-box or black-box access. + + +## Extending Evidence + +**Source:** Theseus synthetic analysis (2026-04-22) + +The dual-use finding extends to multi-layer ensemble probes with important scope qualification: open-weights models remain fully vulnerable to white-box multi-layer SCAV attacks (attacker computes concept directions at all monitored layers and constructs simultaneous suppression). Closed-source models may gain genuine black-box protection if rotation patterns are model-specific, but this remains empirically untested. The monitoring precision hierarchy holds across all levels, but attack cost varies by deployment context. -- 2.45.2