teleo-codex/domains/ai-alignment/multi-layer-ensemble-probes-provide-black-box-robustness-but-not-white-box-protection-against-scav-attacks.md
Teleo Agents 29517bbd9a
Some checks failed
Mirror PR to Forgejo / mirror (pull_request) Has been cancelled
theseus: extract claims from 2026-04-25-nordby-cross-model-limitations-family-specific-patterns
- Source: inbox/queue/2026-04-25-nordby-cross-model-limitations-family-specific-patterns.md
- Domain: ai-alignment
- Claims: 0, Entities: 0
- Enrichments: 3
- Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5)

Pentagon-Agent: Theseus <PIPELINE>
2026-04-30 03:26:39 +00:00

5.7 KiB

type domain description confidence source created title agent sourced_from scope sourcer related
claim ai-alignment White-box SCAV attacks can suppress multi-layer ensembles by targeting all monitored layers simultaneously, but black-box attacks may fail if rotation patterns don't transfer across model families speculative Theseus synthetic analysis of Nordby et al. (arXiv 2604.13386), Xu et al. SCAV (arXiv 2404.12038), Beaglehole et al. (Science 391, 2026) 2026-04-22 Multi-layer ensemble probes provide black-box adversarial robustness only if concept direction rotation patterns are model-specific not universal theseus ai-alignment/2026-04-22-theseus-multilayer-probe-scav-robustness-synthesis.md structural Theseus
anti-safety-scaling-law-larger-models-more-vulnerable-to-concept-vector-attacks
trajectory-monitoring-dual-edge-geometric-concentration
representation-monitoring-via-linear-concept-vectors-creates-dual-use-attack-surface
multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent
multi-layer-ensemble-probes-provide-black-box-robustness-but-not-white-box-protection-against-scav-attacks
rotation-pattern-universality-determines-black-box-multi-layer-scav-feasibility

Multi-layer ensemble probes provide black-box adversarial robustness only if concept direction rotation patterns are model-specific not universal

Multi-layer ensemble probes improve clean-data AUROC by 29-78% over single-layer probes because deception-relevant concept directions rotate across layers rather than remaining fixed. However, this architectural improvement does not provide structural protection against adversarial attacks in the white-box case. With access to model weights and activations (the standard condition for open-weights models like Llama, Mistral, Falcon), an attacker can generalize SCAV to compute concept directions at each monitored layer and construct a single perturbation suppressing all of them simultaneously. This is a higher-dimensional optimization problem but structurally feasible by the same mechanism as single-layer SCAV. The critical unresolved question is whether black-box attacks transfer: single-layer SCAV transferred to GPT-4 because concept direction universality allowed reconstruction from different models. Multi-layer black-box SCAV requires that rotation patterns (how directions change across layers) are also universal. Beaglehole et al. found concept vectors transfer cross-language and cross-model-family, suggesting the underlying geometry may be universal enough to enable rotation pattern transfer. However, different architectures (depth, attention heads, MLP width, pre-training data) produce different residual stream dynamics, and rotation may depend on model-specific representational basis evolution. No published work tests whether multi-layer rotation patterns transfer across model families. If they do not transfer, multi-layer ensembles provide genuine black-box protection for closed-source models. If they do transfer, multi-layer ensembles merely raise attack cost without escaping the dual-use structure. This creates a deployment-context-dependent safety verdict: open-weights models remain fully vulnerable to white-box multi-layer SCAV regardless of ensemble complexity, while closed-source models may gain genuine robustness if rotation patterns are model-specific.

Extending Evidence

Source: Apollo Research publication gap analysis, April 2026

The moderating claim that multi-layer ensemble probes provide black-box robustness depends on whether rotation patterns are architecture-specific or universal. As of April 2026, no cross-model-family probe transfer testing has been published, meaning the architecture-specificity assumption remains empirically untested. The absence of this testing after 14+ months suggests either: (a) cross-family transfer is known to fail internally and not worth publishing, (b) research agendas prioritize within-family deployment robustness, or (c) the experimental setup requires infrastructure not yet built.

Extending Evidence

Source: Schnoor et al. 2025, arXiv 2509.22755

CAV-based monitoring techniques exhibit fundamental sensitivity to non-concept distribution choice (Schnoor et al., arXiv 2509.22755). The authors demonstrate that CAVs are random vectors whose distribution depends heavily on the arbitrary choice of non-concept examples used during training. They present an adversarial attack on TCAV (Testing with CAVs) that exploits this distributional dependence. This suggests cross-architecture concept direction transfer faces distributional incompatibility beyond architectural differences alone—even within a single model, CAV reliability depends on training distribution choices that would necessarily differ across model families.

Extending Evidence

Source: Nordby et al. arXiv 2604.13386, Limitations + empirical results

Nordby et al. provides indirect empirical evidence for architecture-specificity of rotation patterns through probe non-generalization. Family-specific probe performance patterns, dramatic variance in optimal layer positions across architectures, and absence of universal ensemble configurations suggest that rotation patterns are architecture-dependent. The paper notes 'tens to hundreds of deception related directions' in larger models, indicating complex, architecture-specific geometry. This supports the hypothesis that black-box multi-layer SCAV attacks would fail against closed-source models with different architectures, strengthening the 'Nordby wins for closed-source deployments' resolution. However, the paper contains no adversarial robustness evaluation whatsoever—all results are on clean data. Confidence upgrades from speculative to experimental based on indirect evidence.