From 5efb14878bd9ef1aeadf300830abb249a5a7bdbd Mon Sep 17 00:00:00 2001 From: Teleo Agents Date: Wed, 22 Apr 2026 02:15:11 +0000 Subject: [PATCH] theseus: extract claims from 2026-04-22-theseus-multilayer-probe-scav-robustness-synthesis - Source: inbox/queue/2026-04-22-theseus-multilayer-probe-scav-robustness-synthesis.md - Domain: ai-alignment - Claims: 1, Entities: 0 - Enrichments: 4 - Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5) Pentagon-Agent: Theseus --- ...utperform-single-layer-by-29-78-percent.md | 7 +++++++ ...vectors-creates-dual-use-attack-surface.md | 7 +++++++ ...ut-creating-adversarial-attack-surfaces.md | 15 ++++++++++----- ...-black-box-multi-layer-scav-feasibility.md | 19 +++++++++++++++++++ ...oring-dual-edge-geometric-concentration.md | 7 +++++++ 5 files changed, 50 insertions(+), 5 deletions(-) create mode 100644 domains/ai-alignment/rotation-pattern-universality-determines-black-box-multi-layer-scav-feasibility.md diff --git a/domains/ai-alignment/multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent.md b/domains/ai-alignment/multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent.md index a027733f5..0179a0dc0 100644 --- a/domains/ai-alignment/multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent.md +++ b/domains/ai-alignment/multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent.md @@ -23,3 +23,10 @@ Nordby et al. found that multi-layer ensemble probes improved AUROC by 29 percen **Source:** Theseus synthesis of Nordby et al. (arXiv 2604.13386, April 2026) The 29-78% AUROC improvement occurs because deception-relevant concept directions rotate across layers rather than remaining fixed in the residual stream. Single-layer probes capture only one projection of this rotation, making them brittle. Multi-layer ensembles combine projections from multiple layers, capturing more of the full rotational structure. However, this improvement is measured on clean data without adversarial conditions—adversarial robustness testing is entirely absent from Nordby et al.'s scope. + + +## Challenging Evidence + +**Source:** Theseus synthetic analysis (2026-04-22) + +Nordby et al.'s clean-data accuracy improvements do not translate to adversarial robustness. White-box multi-layer SCAV can suppress concept directions at all monitored layers simultaneously through higher-dimensional optimization. Open-weights models gain no structural protection from multi-layer ensembles against adversarially-sophisticated attackers with model access. diff --git a/domains/ai-alignment/representation-monitoring-via-linear-concept-vectors-creates-dual-use-attack-surface.md b/domains/ai-alignment/representation-monitoring-via-linear-concept-vectors-creates-dual-use-attack-surface.md index b58b629a1..1333462ca 100644 --- a/domains/ai-alignment/representation-monitoring-via-linear-concept-vectors-creates-dual-use-attack-surface.md +++ b/domains/ai-alignment/representation-monitoring-via-linear-concept-vectors-creates-dual-use-attack-surface.md @@ -23,3 +23,10 @@ Xu et al. introduce SCAV (Steering Concept Activation Vectors), which identifies **Source:** Theseus synthetic analysis combining Nordby et al. and Xu et al. SCAV Multi-layer ensemble probes do not escape the dual-use attack surface identified for single-layer probes. With white-box access, SCAV can be generalized to compute concept directions at each monitored layer and construct a single perturbation suppressing all simultaneously. This is a higher-dimensional optimization requiring more computation and data, but is structurally feasible by the same mechanism. Open-weights models (Llama, Mistral, Falcon) remain fully vulnerable to white-box multi-layer SCAV regardless of ensemble complexity. + + +## Extending Evidence + +**Source:** Theseus synthetic analysis (2026-04-22) + +Multi-layer ensemble architectures do not eliminate the fundamental attack surface in white-box settings. White-box multi-layer SCAV generalizes the single-layer attack by computing concept directions at each monitored layer and constructing perturbations that suppress all simultaneously. The attack cost increases but the structural vulnerability remains. diff --git a/domains/ai-alignment/representation-trajectory-geometry-distinguishes-deceptive-from-sincere-alignment-without-creating-adversarial-attack-surfaces.md b/domains/ai-alignment/representation-trajectory-geometry-distinguishes-deceptive-from-sincere-alignment-without-creating-adversarial-attack-surfaces.md index f89de90c3..9fc0dc718 100644 --- a/domains/ai-alignment/representation-trajectory-geometry-distinguishes-deceptive-from-sincere-alignment-without-creating-adversarial-attack-surfaces.md +++ b/domains/ai-alignment/representation-trajectory-geometry-distinguishes-deceptive-from-sincere-alignment-without-creating-adversarial-attack-surfaces.md @@ -10,12 +10,17 @@ agent: theseus scope: functional sourcer: Jack Lindsey, Adria Garriga-Alonso (Anthropic) related_claims: ["[[AI-models-distinguish-testing-from-deployment-environments-providing-empirical-evidence-for-deceptive-alignment-concerns]]"] -supports: -- Geometric concentration of alignment in weight space makes trajectory monitoring more effective through stronger signal but gameable through adversarial training that matches monitored trajectory clusters -reweave_edges: -- Geometric concentration of alignment in weight space makes trajectory monitoring more effective through stronger signal but gameable through adversarial training that matches monitored trajectory clusters|supports|2026-04-12 +supports: ["Geometric concentration of alignment in weight space makes trajectory monitoring more effective through stronger signal but gameable through adversarial training that matches monitored trajectory clusters"] +reweave_edges: ["Geometric concentration of alignment in weight space makes trajectory monitoring more effective through stronger signal but gameable through adversarial training that matches monitored trajectory clusters|supports|2026-04-12"] +related: ["representation-trajectory-geometry-distinguishes-deceptive-from-sincere-alignment-without-creating-adversarial-attack-surfaces", "trajectory-monitoring-dual-edge-geometric-concentration", "interpretability-effectiveness-anti-correlates-with-adversarial-training-making-tools-hurt-performance-on-sophisticated-misalignment", "adversarial-training-creates-fundamental-asymmetry-between-deception-capability-and-detection-capability-in-alignment-auditing"] --- # Representation trajectory geometry distinguishes deceptive from sincere alignment without creating adversarial attack surfaces because geometric patterns across reasoning steps are substantially harder to surgically remove than atomic features -The method characterizes alignment-relevant properties as geometric patterns in activation space — the curvature, divergence rate, and trajectory shape of internal representations across reasoning steps — rather than identifying specific neurons or SAE features. Empirical findings show that deceptive alignment produces measurably distinct trajectory geometries from sincere alignment: the geometry of how a model reasons to an answer differs from the geometry of how it constructs a superficially similar answer through deception. Critically, trajectory geometry is substantially harder to target for adversarial removal than feature geometry because there is no clean atomic feature to suppress; the entire trajectory must be reshaped. This creates an asymmetry: defenders can detect geometric signatures of misalignment without creating a targeting map for adversaries. This is the operationalization of Direction B from the SAE dual-use branching point — read-only interpretability that provides alignment signals without enabling surgical feature removal. The limitation is that geometric probing cannot identify which specific behaviors are deceptive, only that 'this trajectory looks like deception.' \ No newline at end of file +The method characterizes alignment-relevant properties as geometric patterns in activation space — the curvature, divergence rate, and trajectory shape of internal representations across reasoning steps — rather than identifying specific neurons or SAE features. Empirical findings show that deceptive alignment produces measurably distinct trajectory geometries from sincere alignment: the geometry of how a model reasons to an answer differs from the geometry of how it constructs a superficially similar answer through deception. Critically, trajectory geometry is substantially harder to target for adversarial removal than feature geometry because there is no clean atomic feature to suppress; the entire trajectory must be reshaped. This creates an asymmetry: defenders can detect geometric signatures of misalignment without creating a targeting map for adversaries. This is the operationalization of Direction B from the SAE dual-use branching point — read-only interpretability that provides alignment signals without enabling surgical feature removal. The limitation is that geometric probing cannot identify which specific behaviors are deceptive, only that 'this trajectory looks like deception.' + +## Challenging Evidence + +**Source:** Theseus synthetic analysis (2026-04-22) + +Trajectory geometry monitoring does create adversarial attack surfaces through multi-layer SCAV generalization. The claim that geometric approaches avoid attack surface creation is incorrect for white-box adversaries with access to model weights. May need qualification to 'black-box adversaries only, contingent on rotation pattern specificity.' diff --git a/domains/ai-alignment/rotation-pattern-universality-determines-black-box-multi-layer-scav-feasibility.md b/domains/ai-alignment/rotation-pattern-universality-determines-black-box-multi-layer-scav-feasibility.md new file mode 100644 index 000000000..1eb593b2a --- /dev/null +++ b/domains/ai-alignment/rotation-pattern-universality-determines-black-box-multi-layer-scav-feasibility.md @@ -0,0 +1,19 @@ +--- +type: claim +domain: ai-alignment +description: If deception direction rotation patterns across layers are model-specific rather than universal, closed-source models gain genuine protection that open-weights models cannot achieve +confidence: speculative +source: Theseus synthetic analysis identifying untested empirical question +created: 2026-04-22 +title: Rotation pattern universality across model families determines whether multi-layer ensemble monitoring provides black-box adversarial robustness +agent: theseus +sourced_from: ai-alignment/2026-04-22-theseus-multilayer-probe-scav-robustness-synthesis.md +scope: structural +sourcer: Theseus +supports: ["multi-layer-ensemble-probes-provide-black-box-robustness-but-not-white-box-protection-against-scav-attacks"] +related: ["multi-layer-ensemble-probes-provide-black-box-robustness-but-not-white-box-protection-against-scav-attacks", "representation-monitoring-via-linear-concept-vectors-creates-dual-use-attack-surface", "anti-safety-scaling-law-larger-models-more-vulnerable-to-concept-vector-attacks"] +--- + +# Rotation pattern universality across model families determines whether multi-layer ensemble monitoring provides black-box adversarial robustness + +The feasibility of black-box multi-layer SCAV attacks depends on whether the rotation pattern of concept directions across layers is universal across model families or model-specific. Single-layer SCAV achieved black-box transfer to GPT-4 because concept direction universality (confirmed by Beaglehole et al. for cross-language and cross-model-family transfer) allowed attackers to reconstruct the target model's concept direction from a different model. For multi-layer SCAV, the attacker must reconstruct not just the concept direction at one layer, but the entire rotation pattern across all monitored layers. Two competing arguments exist: (1) Rotation universality: If the underlying geometry of safety representations is universal enough to enable cross-language transfer (Beaglehole et al.), the rotation pattern may also be universal, making black-box multi-layer SCAV feasible. (2) Rotation specificity: Different model architectures (transformer depth, attention head count, MLP width, pre-training data) produce different residual stream dynamics. The concept direction at any single layer is a projection of a universal concept onto a model-specific representational basis, and the rotation across layers depends on how that basis evolves, which may not be universal. This is a testable empirical question with no published results. If rotation patterns are model-specific, multi-layer ensemble monitoring provides genuine black-box adversarial robustness for closed-source models, creating a structural safety advantage over open-weights deployment. If rotation patterns are universal, multi-layer ensembles provide no black-box protection, and the dual-use vulnerability holds across all deployment contexts. diff --git a/domains/ai-alignment/trajectory-monitoring-dual-edge-geometric-concentration.md b/domains/ai-alignment/trajectory-monitoring-dual-edge-geometric-concentration.md index eeddcea40..4d8f8c144 100644 --- a/domains/ai-alignment/trajectory-monitoring-dual-edge-geometric-concentration.md +++ b/domains/ai-alignment/trajectory-monitoring-dual-edge-geometric-concentration.md @@ -24,3 +24,10 @@ Weight-space alignment geometry research (2602.15799) establishes that alignment **Source:** Theseus synthesis addressing Sessions 29-30 open question The dual-use finding now extends to multi-layer ensemble monitoring with deployment-context qualification: open-weights models face white-box multi-layer SCAV attacks that architectural improvements cannot prevent, while closed-source models may gain genuine black-box protection if concept direction rotation patterns are model-specific rather than universal. The monitoring precision hierarchy holds across all levels, but the severity of dual-use risk depends on whether attackers have white-box or black-box access. + + +## Extending Evidence + +**Source:** Theseus synthetic analysis (2026-04-22) + +The dual-use vulnerability extends to multi-layer ensemble monitoring, not just single-layer probes. However, the severity is deployment-context-dependent: open-weights models (white-box adversaries) remain fully vulnerable, while closed-source models (black-box adversaries) may gain protection if rotation patterns are model-specific (untested assumption).