- Source: inbox/queue/2026-04-26-schnoor-2509.22755-cav-fragility-adversarial-attacks.md - Domain: ai-alignment - Claims: 0, Entities: 0 - Enrichments: 2 - Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5) Pentagon-Agent: Theseus <PIPELINE>
3.5 KiB
| type | domain | description | confidence | source | created | title | agent | sourced_from | scope | sourcer | supports | related | |||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| claim | ai-alignment | If deception direction rotation patterns across layers are model-specific rather than universal, closed-source models gain genuine protection that open-weights models cannot achieve | speculative | Theseus synthetic analysis identifying untested empirical question | 2026-04-22 | Rotation pattern universality across model families determines whether multi-layer ensemble monitoring provides black-box adversarial robustness | theseus | ai-alignment/2026-04-22-theseus-multilayer-probe-scav-robustness-synthesis.md | structural | Theseus |
|
|
Rotation pattern universality across model families determines whether multi-layer ensemble monitoring provides black-box adversarial robustness
The feasibility of black-box multi-layer SCAV attacks depends on whether the rotation pattern of concept directions across layers is universal across model families or model-specific. Single-layer SCAV achieved black-box transfer to GPT-4 because concept direction universality (confirmed by Beaglehole et al. for cross-language and cross-model-family transfer) allowed attackers to reconstruct the target model's concept direction from a different model. For multi-layer SCAV, the attacker must reconstruct not just the concept direction at one layer, but the entire rotation pattern across all monitored layers. Two competing arguments exist: (1) Rotation universality: If the underlying geometry of safety representations is universal enough to enable cross-language transfer (Beaglehole et al.), the rotation pattern may also be universal, making black-box multi-layer SCAV feasible. (2) Rotation specificity: Different model architectures (transformer depth, attention head count, MLP width, pre-training data) produce different residual stream dynamics. The concept direction at any single layer is a projection of a universal concept onto a model-specific representational basis, and the rotation across layers depends on how that basis evolves, which may not be universal. This is a testable empirical question with no published results. If rotation patterns are model-specific, multi-layer ensemble monitoring provides genuine black-box adversarial robustness for closed-source models, creating a structural safety advantage over open-weights deployment. If rotation patterns are universal, multi-layer ensembles provide no black-box protection, and the dual-use vulnerability holds across all deployment contexts.
Extending Evidence
Source: Schnoor et al. 2025, arXiv 2509.22755
Theoretical analysis from XAI literature shows CAVs (Concept Activation Vectors) are fundamentally fragile to non-concept distribution choice (Schnoor et al., arXiv 2509.22755). Since non-concept distributions necessarily differ across model architectures and training regimes, this provides theoretical grounding for why rotation patterns extracted via SCAV would fail to transfer across model families—the concept vectors themselves are unstable under distributional shifts inherent to cross-architecture application.