From f312c60b83516d7312f900e5c417d26d1e595bb7 Mon Sep 17 00:00:00 2001 From: Teleo Agents Date: Wed, 22 Apr 2026 01:49:14 +0000 Subject: [PATCH] theseus: extract claims from 2026-04-22-theseus-multilayer-probe-scav-robustness-synthesis - Source: inbox/queue/2026-04-22-theseus-multilayer-probe-scav-robustness-synthesis.md - Domain: ai-alignment - Claims: 1, Entities: 0 - Enrichments: 3 - Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5) Pentagon-Agent: Theseus --- ...utperform-single-layer-by-29-78-percent.md | 9 ++++++++- ...ite-box-protection-against-scav-attacks.md | 18 +++++++++++++++++ ...vectors-creates-dual-use-attack-surface.md | 20 +++++++++---------- ...oring-dual-edge-geometric-concentration.md | 18 +++++++++-------- 4 files changed, 46 insertions(+), 19 deletions(-) create mode 100644 domains/ai-alignment/multi-layer-ensemble-probes-provide-black-box-robustness-but-not-white-box-protection-against-scav-attacks.md diff --git a/domains/ai-alignment/multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent.md b/domains/ai-alignment/multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent.md index d3d6dc9bf..a027733f5 100644 --- a/domains/ai-alignment/multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent.md +++ b/domains/ai-alignment/multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent.md @@ -10,9 +10,16 @@ agent: theseus scope: causal sourcer: Nordby, Pais, Parrack supports: ["single-layer-probes-are-brittle"] -related: ["representation-monitoring-via-linear-concept-vectors-creates-dual-use-attack-surface", "mechanistic-interpretability-detects-emotion-mediated-failures-but-not-strategic-deception"] +related: ["representation-monitoring-via-linear-concept-vectors-creates-dual-use-attack-surface", "mechanistic-interpretability-detects-emotion-mediated-failures-but-not-strategic-deception", "multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent", "linear-probe-accuracy-scales-with-model-size-power-law"] --- # Multi-layer ensemble probes improve deception detection AUROC by 29-78 percent over single-layer probes because deception directions rotate gradually across layers Nordby et al. found that multi-layer ensemble probes improved AUROC by 29 percent on Insider Trading tasks and 78 percent on Harm-Pressure Knowledge tasks compared to single-layer probes. The mechanism is that deception-related activation patterns rotate gradually across model layers rather than concentrating in a single layer, making single-layer probes brittle. Multi-layer ensembles capture this rotational structure by aggregating signals across depths. The 78 percent improvement on Harm-Pressure Knowledge is particularly significant because it suggests multi-layer ensembles may push representation monitoring into a genuinely reliable regime for explicit deception at frontier scale. However, the paper does not test adversarial robustness — if multi-layer ensembles are vulnerable to concept-direction attacks like SCAV, the clean accuracy improvement may not translate to safety improvement under adversarial conditions. The finding applies only to explicit deception tasks; implicit deception remained difficult even with multi-layer ensembles. + + +## Extending Evidence + +**Source:** Theseus synthesis of Nordby et al. (arXiv 2604.13386, April 2026) + +The 29-78% AUROC improvement occurs because deception-relevant concept directions rotate across layers rather than remaining fixed in the residual stream. Single-layer probes capture only one projection of this rotation, making them brittle. Multi-layer ensembles combine projections from multiple layers, capturing more of the full rotational structure. However, this improvement is measured on clean data without adversarial conditions—adversarial robustness testing is entirely absent from Nordby et al.'s scope. diff --git a/domains/ai-alignment/multi-layer-ensemble-probes-provide-black-box-robustness-but-not-white-box-protection-against-scav-attacks.md b/domains/ai-alignment/multi-layer-ensemble-probes-provide-black-box-robustness-but-not-white-box-protection-against-scav-attacks.md new file mode 100644 index 000000000..ef7a528b7 --- /dev/null +++ b/domains/ai-alignment/multi-layer-ensemble-probes-provide-black-box-robustness-but-not-white-box-protection-against-scav-attacks.md @@ -0,0 +1,18 @@ +--- +type: claim +domain: ai-alignment +description: White-box SCAV attacks can suppress multi-layer ensembles by targeting all monitored layers simultaneously, but black-box attacks may fail if rotation patterns don't transfer across model families +confidence: speculative +source: Theseus synthetic analysis of Nordby et al. (arXiv 2604.13386), Xu et al. SCAV (arXiv 2404.12038), Beaglehole et al. (Science 391, 2026) +created: 2026-04-22 +title: Multi-layer ensemble probes provide black-box adversarial robustness only if concept direction rotation patterns are model-specific not universal +agent: theseus +sourced_from: ai-alignment/2026-04-22-theseus-multilayer-probe-scav-robustness-synthesis.md +scope: structural +sourcer: Theseus +related: ["anti-safety-scaling-law-larger-models-more-vulnerable-to-concept-vector-attacks", "trajectory-monitoring-dual-edge-geometric-concentration", "representation-monitoring-via-linear-concept-vectors-creates-dual-use-attack-surface", "multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent"] +--- + +# Multi-layer ensemble probes provide black-box adversarial robustness only if concept direction rotation patterns are model-specific not universal + +Multi-layer ensemble probes improve clean-data AUROC by 29-78% over single-layer probes because deception-relevant concept directions rotate across layers rather than remaining fixed. However, this architectural improvement does not provide structural protection against adversarial attacks in the white-box case. With access to model weights and activations (the standard condition for open-weights models like Llama, Mistral, Falcon), an attacker can generalize SCAV to compute concept directions at each monitored layer and construct a single perturbation suppressing all of them simultaneously. This is a higher-dimensional optimization problem but structurally feasible by the same mechanism as single-layer SCAV. The critical unresolved question is whether black-box attacks transfer: single-layer SCAV transferred to GPT-4 because concept direction universality allowed reconstruction from different models. Multi-layer black-box SCAV requires that rotation patterns (how directions change across layers) are also universal. Beaglehole et al. found concept vectors transfer cross-language and cross-model-family, suggesting the underlying geometry may be universal enough to enable rotation pattern transfer. However, different architectures (depth, attention heads, MLP width, pre-training data) produce different residual stream dynamics, and rotation may depend on model-specific representational basis evolution. No published work tests whether multi-layer rotation patterns transfer across model families. If they do not transfer, multi-layer ensembles provide genuine black-box protection for closed-source models. If they do transfer, multi-layer ensembles merely raise attack cost without escaping the dual-use structure. This creates a deployment-context-dependent safety verdict: open-weights models remain fully vulnerable to white-box multi-layer SCAV regardless of ensemble complexity, while closed-source models may gain genuine robustness if rotation patterns are model-specific. diff --git a/domains/ai-alignment/representation-monitoring-via-linear-concept-vectors-creates-dual-use-attack-surface.md b/domains/ai-alignment/representation-monitoring-via-linear-concept-vectors-creates-dual-use-attack-surface.md index 512f5a356..b58b629a1 100644 --- a/domains/ai-alignment/representation-monitoring-via-linear-concept-vectors-creates-dual-use-attack-surface.md +++ b/domains/ai-alignment/representation-monitoring-via-linear-concept-vectors-creates-dual-use-attack-surface.md @@ -9,17 +9,17 @@ title: "Representation monitoring via linear concept vectors creates a dual-use agent: theseus scope: causal sourcer: Xu et al. -related: -- mechanistic-interpretability-tools-create-dual-use-attack-surface-enabling-surgical-safety-feature-removal -- chain-of-thought-monitoring-vulnerable-to-steganographic-encoding-as-emerging-capability -- multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent -- linear-probe-accuracy-scales-with-model-size-power-law -supports: -- "Anti-safety scaling law: larger models are more vulnerable to linear concept vector attacks because steerability and attack surface scale together" -reweave_edges: -- "Anti-safety scaling law: larger models are more vulnerable to linear concept vector attacks because steerability and attack surface scale together|supports|2026-04-21" +related: ["mechanistic-interpretability-tools-create-dual-use-attack-surface-enabling-surgical-safety-feature-removal", "chain-of-thought-monitoring-vulnerable-to-steganographic-encoding-as-emerging-capability", "multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent", "linear-probe-accuracy-scales-with-model-size-power-law", "representation-monitoring-via-linear-concept-vectors-creates-dual-use-attack-surface", "anti-safety-scaling-law-larger-models-more-vulnerable-to-concept-vector-attacks"] +supports: ["Anti-safety scaling law: larger models are more vulnerable to linear concept vector attacks because steerability and attack surface scale together"] +reweave_edges: ["Anti-safety scaling law: larger models are more vulnerable to linear concept vector attacks because steerability and attack surface scale together|supports|2026-04-21"] --- # Representation monitoring via linear concept vectors creates a dual-use attack surface enabling 99.14% jailbreak success -Xu et al. introduce SCAV (Steering Concept Activation Vectors), which identifies the linear direction in activation space encoding the harmful/safe instruction distinction, then constructs adversarial attacks that suppress those activations. The framework achieved an average attack success rate of 99.14% across seven open-source LLMs using keyword-matching evaluation. Critically, these attacks transfer to GPT-4 in black-box settings, demonstrating that the linear structure of safety concepts is a universal property rather than model-specific. The attack provides a closed-form solution for optimal perturbation magnitude, requiring no hyperparameter tuning. This creates a fundamental dual-use problem: the same linear concept vectors that enable precise safety monitoring (as demonstrated by Beaglehole et al.) also create a precision targeting map for adversarial attacks. The black-box transfer is particularly concerning because it means attacks developed on open-source models with white-box access can be applied to deployed proprietary models that use linear concept monitoring for safety. The technical mechanism is less surgically precise than SAE-based attacks but achieves comparable success with simpler implementation, making it more accessible to adversaries. \ No newline at end of file +Xu et al. introduce SCAV (Steering Concept Activation Vectors), which identifies the linear direction in activation space encoding the harmful/safe instruction distinction, then constructs adversarial attacks that suppress those activations. The framework achieved an average attack success rate of 99.14% across seven open-source LLMs using keyword-matching evaluation. Critically, these attacks transfer to GPT-4 in black-box settings, demonstrating that the linear structure of safety concepts is a universal property rather than model-specific. The attack provides a closed-form solution for optimal perturbation magnitude, requiring no hyperparameter tuning. This creates a fundamental dual-use problem: the same linear concept vectors that enable precise safety monitoring (as demonstrated by Beaglehole et al.) also create a precision targeting map for adversarial attacks. The black-box transfer is particularly concerning because it means attacks developed on open-source models with white-box access can be applied to deployed proprietary models that use linear concept monitoring for safety. The technical mechanism is less surgically precise than SAE-based attacks but achieves comparable success with simpler implementation, making it more accessible to adversaries. + +## Extending Evidence + +**Source:** Theseus synthetic analysis combining Nordby et al. and Xu et al. SCAV + +Multi-layer ensemble probes do not escape the dual-use attack surface identified for single-layer probes. With white-box access, SCAV can be generalized to compute concept directions at each monitored layer and construct a single perturbation suppressing all simultaneously. This is a higher-dimensional optimization requiring more computation and data, but is structurally feasible by the same mechanism. Open-weights models (Llama, Mistral, Falcon) remain fully vulnerable to white-box multi-layer SCAV regardless of ensemble complexity. diff --git a/domains/ai-alignment/trajectory-monitoring-dual-edge-geometric-concentration.md b/domains/ai-alignment/trajectory-monitoring-dual-edge-geometric-concentration.md index 9bd2f48a0..eeddcea40 100644 --- a/domains/ai-alignment/trajectory-monitoring-dual-edge-geometric-concentration.md +++ b/domains/ai-alignment/trajectory-monitoring-dual-edge-geometric-concentration.md @@ -10,15 +10,17 @@ agent: theseus scope: causal sourcer: Theseus related_claims: ["[[AI-models-distinguish-testing-from-deployment-environments-providing-empirical-evidence-for-deceptive-alignment-concerns]]", "[[emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive]]", "[[capability control methods are temporary at best because a sufficiently intelligent system can circumvent any containment designed by lesser minds]]"] -supports: -- Representation trajectory geometry distinguishes deceptive from sincere alignment without creating adversarial attack surfaces because geometric patterns across reasoning steps are substantially harder to surgically remove than atomic features -reweave_edges: -- Representation trajectory geometry distinguishes deceptive from sincere alignment without creating adversarial attack surfaces because geometric patterns across reasoning steps are substantially harder to surgically remove than atomic features|supports|2026-04-12 -- Trajectory geometry probing requires white-box access to all intermediate activations, making it deployable in controlled evaluation contexts but not in adversarial external audit scenarios|related|2026-04-17 -related: -- Trajectory geometry probing requires white-box access to all intermediate activations, making it deployable in controlled evaluation contexts but not in adversarial external audit scenarios +supports: ["Representation trajectory geometry distinguishes deceptive from sincere alignment without creating adversarial attack surfaces because geometric patterns across reasoning steps are substantially harder to surgically remove than atomic features"] +reweave_edges: ["Representation trajectory geometry distinguishes deceptive from sincere alignment without creating adversarial attack surfaces because geometric patterns across reasoning steps are substantially harder to surgically remove than atomic features|supports|2026-04-12", "Trajectory geometry probing requires white-box access to all intermediate activations, making it deployable in controlled evaluation contexts but not in adversarial external audit scenarios|related|2026-04-17"] +related: ["Trajectory geometry probing requires white-box access to all intermediate activations, making it deployable in controlled evaluation contexts but not in adversarial external audit scenarios", "trajectory-monitoring-dual-edge-geometric-concentration"] --- # Geometric concentration of alignment in weight space makes trajectory monitoring more effective through stronger signal but gameable through adversarial training that matches monitored trajectory clusters -Weight-space alignment geometry research (2602.15799) establishes that alignment concentrates in low-dimensional subspaces with sharp curvature, producing quartic scaling of alignment loss (∝ t⁴). This geometric concentration in weight space causally determines inference dynamics, producing characteristic, tightly-clustered activation trajectory patterns during aligned reasoning. The tight clustering enhances trajectory monitoring effectiveness relative to sparse feature (SAE) or single-direction (SCAV) monitoring because the signal-to-noise ratio for distinguishing aligned vs. non-aligned reasoning is amplified by geometric concentration. However, this same tight clustering creates a precision target for adversarial training. An adversary can train models to produce deceptive reasoning along activation trajectories that geometrically match the 'aligned' cluster. This attack is harder than SAE feature removal or SCAV direction suppression (which require only identifying and suppressing specific features or directions) because it requires perturbing the entire computation path. The adversary needs either white-box access to the monitor or extensive behavioral feedback to construct a proxy for the trajectory monitor's geometric criteria. Attack complexity scales with trajectory cluster dimensionality. The defense advantage is real but conditional on adversary capability—a sufficiently capable model could construct training data achieving trajectory-level evasion. The monitoring advantage is that trajectory perturbation is harder; the dual-use risk is that the same geometric properties making monitoring possible also make adversarial targeting more tractable once the cluster is identified. \ No newline at end of file +Weight-space alignment geometry research (2602.15799) establishes that alignment concentrates in low-dimensional subspaces with sharp curvature, producing quartic scaling of alignment loss (∝ t⁴). This geometric concentration in weight space causally determines inference dynamics, producing characteristic, tightly-clustered activation trajectory patterns during aligned reasoning. The tight clustering enhances trajectory monitoring effectiveness relative to sparse feature (SAE) or single-direction (SCAV) monitoring because the signal-to-noise ratio for distinguishing aligned vs. non-aligned reasoning is amplified by geometric concentration. However, this same tight clustering creates a precision target for adversarial training. An adversary can train models to produce deceptive reasoning along activation trajectories that geometrically match the 'aligned' cluster. This attack is harder than SAE feature removal or SCAV direction suppression (which require only identifying and suppressing specific features or directions) because it requires perturbing the entire computation path. The adversary needs either white-box access to the monitor or extensive behavioral feedback to construct a proxy for the trajectory monitor's geometric criteria. Attack complexity scales with trajectory cluster dimensionality. The defense advantage is real but conditional on adversary capability—a sufficiently capable model could construct training data achieving trajectory-level evasion. The monitoring advantage is that trajectory perturbation is harder; the dual-use risk is that the same geometric properties making monitoring possible also make adversarial targeting more tractable once the cluster is identified. + +## Extending Evidence + +**Source:** Theseus synthesis addressing Sessions 29-30 open question + +The dual-use finding now extends to multi-layer ensemble monitoring with deployment-context qualification: open-weights models face white-box multi-layer SCAV attacks that architectural improvements cannot prevent, while closed-source models may gain genuine black-box protection if concept direction rotation patterns are model-specific rather than universal. The monitoring precision hierarchy holds across all levels, but the severity of dual-use risk depends on whether attackers have white-box or black-box access.