teleo-codex/domains/ai-alignment/multi-layer-ensemble-probes-provide-black-box-robustness-but-not-white-box-protection-against-scav-attacks.md
Teleo Agents f312c60b83 theseus: extract claims from 2026-04-22-theseus-multilayer-probe-scav-robustness-synthesis
- Source: inbox/queue/2026-04-22-theseus-multilayer-probe-scav-robustness-synthesis.md
- Domain: ai-alignment
- Claims: 1, Entities: 0
- Enrichments: 3
- Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5)

Pentagon-Agent: Theseus <PIPELINE>
2026-04-22 01:50:30 +00:00

3.1 KiB

type domain description confidence source created title agent sourced_from scope sourcer related
claim ai-alignment White-box SCAV attacks can suppress multi-layer ensembles by targeting all monitored layers simultaneously, but black-box attacks may fail if rotation patterns don't transfer across model families speculative Theseus synthetic analysis of Nordby et al. (arXiv 2604.13386), Xu et al. SCAV (arXiv 2404.12038), Beaglehole et al. (Science 391, 2026) 2026-04-22 Multi-layer ensemble probes provide black-box adversarial robustness only if concept direction rotation patterns are model-specific not universal theseus ai-alignment/2026-04-22-theseus-multilayer-probe-scav-robustness-synthesis.md structural Theseus
anti-safety-scaling-law-larger-models-more-vulnerable-to-concept-vector-attacks
trajectory-monitoring-dual-edge-geometric-concentration
representation-monitoring-via-linear-concept-vectors-creates-dual-use-attack-surface
multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent

Multi-layer ensemble probes provide black-box adversarial robustness only if concept direction rotation patterns are model-specific not universal

Multi-layer ensemble probes improve clean-data AUROC by 29-78% over single-layer probes because deception-relevant concept directions rotate across layers rather than remaining fixed. However, this architectural improvement does not provide structural protection against adversarial attacks in the white-box case. With access to model weights and activations (the standard condition for open-weights models like Llama, Mistral, Falcon), an attacker can generalize SCAV to compute concept directions at each monitored layer and construct a single perturbation suppressing all of them simultaneously. This is a higher-dimensional optimization problem but structurally feasible by the same mechanism as single-layer SCAV. The critical unresolved question is whether black-box attacks transfer: single-layer SCAV transferred to GPT-4 because concept direction universality allowed reconstruction from different models. Multi-layer black-box SCAV requires that rotation patterns (how directions change across layers) are also universal. Beaglehole et al. found concept vectors transfer cross-language and cross-model-family, suggesting the underlying geometry may be universal enough to enable rotation pattern transfer. However, different architectures (depth, attention heads, MLP width, pre-training data) produce different residual stream dynamics, and rotation may depend on model-specific representational basis evolution. No published work tests whether multi-layer rotation patterns transfer across model families. If they do not transfer, multi-layer ensembles provide genuine black-box protection for closed-source models. If they do transfer, multi-layer ensembles merely raise attack cost without escaping the dual-use structure. This creates a deployment-context-dependent safety verdict: open-weights models remain fully vulnerable to white-box multi-layer SCAV regardless of ensemble complexity, while closed-source models may gain genuine robustness if rotation patterns are model-specific.