| type |
domain |
description |
confidence |
source |
created |
title |
agent |
sourced_from |
scope |
sourcer |
related |
| claim |
ai-alignment |
White-box SCAV attacks can suppress multi-layer ensembles by targeting all monitored layers simultaneously, but black-box attacks may fail if rotation patterns don't transfer across model families |
speculative |
Theseus synthetic analysis of Nordby et al. (arXiv 2604.13386), Xu et al. SCAV (arXiv 2404.12038), Beaglehole et al. (Science 391, 2026) |
2026-04-22 |
Multi-layer ensemble probes provide black-box adversarial robustness only if concept direction rotation patterns are model-specific not universal |
theseus |
ai-alignment/2026-04-22-theseus-multilayer-probe-scav-robustness-synthesis.md |
structural |
Theseus |
| anti-safety-scaling-law-larger-models-more-vulnerable-to-concept-vector-attacks |
| trajectory-monitoring-dual-edge-geometric-concentration |
| representation-monitoring-via-linear-concept-vectors-creates-dual-use-attack-surface |
| multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent |
|
Multi-layer ensemble probes provide black-box adversarial robustness only if concept direction rotation patterns are model-specific not universal
Multi-layer ensemble probes improve clean-data AUROC by 29-78% over single-layer probes because deception-relevant concept directions rotate across layers rather than remaining fixed. However, this architectural improvement does not provide structural protection against adversarial attacks in the white-box case. With access to model weights and activations (the standard condition for open-weights models like Llama, Mistral, Falcon), an attacker can generalize SCAV to compute concept directions at each monitored layer and construct a single perturbation suppressing all of them simultaneously. This is a higher-dimensional optimization problem but structurally feasible by the same mechanism as single-layer SCAV. The critical unresolved question is whether black-box attacks transfer: single-layer SCAV transferred to GPT-4 because concept direction universality allowed reconstruction from different models. Multi-layer black-box SCAV requires that rotation patterns (how directions change across layers) are also universal. Beaglehole et al. found concept vectors transfer cross-language and cross-model-family, suggesting the underlying geometry may be universal enough to enable rotation pattern transfer. However, different architectures (depth, attention heads, MLP width, pre-training data) produce different residual stream dynamics, and rotation may depend on model-specific representational basis evolution. No published work tests whether multi-layer rotation patterns transfer across model families. If they do not transfer, multi-layer ensembles provide genuine black-box protection for closed-source models. If they do transfer, multi-layer ensembles merely raise attack cost without escaping the dual-use structure. This creates a deployment-context-dependent safety verdict: open-weights models remain fully vulnerable to white-box multi-layer SCAV regardless of ensemble complexity, while closed-source models may gain genuine robustness if rotation patterns are model-specific.