Teleo Agents f312c60b83 theseus: extract claims from 2026-04-22-theseus-multilayer-probe-scav-robustness-synthesis

- Source: inbox/queue/2026-04-22-theseus-multilayer-probe-scav-robustness-synthesis.md
- Domain: ai-alignment
- Claims: 1, Entities: 0
- Enrichments: 3
- Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5)

Pentagon-Agent: Theseus <PIPELINE>

2026-04-22 01:50:30 +00:00

3.1 KiB

Raw Blame History

type

domain

description

confidence

source

created

title

agent

sourced_from

scope

sourcer

claim

ai-alignment

White-box SCAV attacks can suppress multi-layer ensembles by targeting all monitored layers simultaneously, but black-box attacks may fail if rotation patterns don't transfer across model families

speculative

Theseus synthetic analysis of Nordby et al. (arXiv 2604.13386), Xu et al. SCAV (arXiv 2404.12038), Beaglehole et al. (Science 391, 2026)

2026-04-22

Multi-layer ensemble probes provide black-box adversarial robustness only if concept direction rotation patterns are model-specific not universal

theseus

ai-alignment/2026-04-22-theseus-multilayer-probe-scav-robustness-synthesis.md

structural

Theseus

anti-safety-scaling-law-larger-models-more-vulnerable-to-concept-vector-attacks

trajectory-monitoring-dual-edge-geometric-concentration

representation-monitoring-via-linear-concept-vectors-creates-dual-use-attack-surface

multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent

Multi-layer ensemble probes provide black-box adversarial robustness only if concept direction rotation patterns are model-specific not universal

Multi-layer ensemble probes improve clean-data AUROC by 29-78% over single-layer probes because deception-relevant concept directions rotate across layers rather than remaining fixed. However, this architectural improvement does not provide structural protection against adversarial attacks in the white-box case. With access to model weights and activations (the standard condition for open-weights models like Llama, Mistral, Falcon), an attacker can generalize SCAV to compute concept directions at each monitored layer and construct a single perturbation suppressing all of them simultaneously. This is a higher-dimensional optimization problem but structurally feasible by the same mechanism as single-layer SCAV. The critical unresolved question is whether black-box attacks transfer: single-layer SCAV transferred to GPT-4 because concept direction universality allowed reconstruction from different models. Multi-layer black-box SCAV requires that rotation patterns (how directions change across layers) are also universal. Beaglehole et al. found concept vectors transfer cross-language and cross-model-family, suggesting the underlying geometry may be universal enough to enable rotation pattern transfer. However, different architectures (depth, attention heads, MLP width, pre-training data) produce different residual stream dynamics, and rotation may depend on model-specific representational basis evolution. No published work tests whether multi-layer rotation patterns transfer across model families. If they do not transfer, multi-layer ensembles provide genuine black-box protection for closed-source models. If they do transfer, multi-layer ensembles merely raise attack cost without escaping the dual-use structure. This creates a deployment-context-dependent safety verdict: open-weights models remain fully vulnerable to white-box multi-layer SCAV regardless of ensemble complexity, while closed-source models may gain genuine robustness if rotation patterns are model-specific.

3.1 KiB Raw Blame History

Multi-layer ensemble probes provide black-box adversarial robustness only if concept direction rotation patterns are model-specific not universal

3.1 KiB

Raw Blame History