Teleo Agents 8c2fdbb44a theseus: extract claims from 2026-04-26-schnoor-2509.22755-cav-fragility-adversarial-attacks

- Source: inbox/queue/2026-04-26-schnoor-2509.22755-cav-fragility-adversarial-attacks.md
- Domain: ai-alignment
- Claims: 0, Entities: 0
- Enrichments: 2
- Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5)

Pentagon-Agent: Theseus <PIPELINE>

2026-04-26 00:29:24 +00:00

3.5 KiB

Raw Blame History

type

domain

description

confidence

source

created

title

agent

sourced_from

scope

sourcer

supports

claim

ai-alignment

If deception direction rotation patterns across layers are model-specific rather than universal, closed-source models gain genuine protection that open-weights models cannot achieve

speculative

Theseus synthetic analysis identifying untested empirical question

2026-04-22

Rotation pattern universality across model families determines whether multi-layer ensemble monitoring provides black-box adversarial robustness

theseus

ai-alignment/2026-04-22-theseus-multilayer-probe-scav-robustness-synthesis.md

structural

Theseus

multi-layer-ensemble-probes-provide-black-box-robustness-but-not-white-box-protection-against-scav-attacks

representation-monitoring-via-linear-concept-vectors-creates-dual-use-attack-surface

anti-safety-scaling-law-larger-models-more-vulnerable-to-concept-vector-attacks

rotation-pattern-universality-determines-black-box-multi-layer-scav-feasibility

Rotation pattern universality across model families determines whether multi-layer ensemble monitoring provides black-box adversarial robustness

The feasibility of black-box multi-layer SCAV attacks depends on whether the rotation pattern of concept directions across layers is universal across model families or model-specific. Single-layer SCAV achieved black-box transfer to GPT-4 because concept direction universality (confirmed by Beaglehole et al. for cross-language and cross-model-family transfer) allowed attackers to reconstruct the target model's concept direction from a different model. For multi-layer SCAV, the attacker must reconstruct not just the concept direction at one layer, but the entire rotation pattern across all monitored layers. Two competing arguments exist: (1) Rotation universality: If the underlying geometry of safety representations is universal enough to enable cross-language transfer (Beaglehole et al.), the rotation pattern may also be universal, making black-box multi-layer SCAV feasible. (2) Rotation specificity: Different model architectures (transformer depth, attention head count, MLP width, pre-training data) produce different residual stream dynamics. The concept direction at any single layer is a projection of a universal concept onto a model-specific representational basis, and the rotation across layers depends on how that basis evolves, which may not be universal. This is a testable empirical question with no published results. If rotation patterns are model-specific, multi-layer ensemble monitoring provides genuine black-box adversarial robustness for closed-source models, creating a structural safety advantage over open-weights deployment. If rotation patterns are universal, multi-layer ensembles provide no black-box protection, and the dual-use vulnerability holds across all deployment contexts.

Extending Evidence

Source: Schnoor et al. 2025, arXiv 2509.22755

Theoretical analysis from XAI literature shows CAVs (Concept Activation Vectors) are fundamentally fragile to non-concept distribution choice (Schnoor et al., arXiv 2509.22755). Since non-concept distributions necessarily differ across model architectures and training regimes, this provides theoretical grounding for why rotation patterns extracted via SCAV would fail to transfer across model families—the concept vectors themselves are unstable under distributional shifts inherent to cross-architecture application.

3.5 KiB Raw Blame History

Rotation pattern universality across model families determines whether multi-layer ensemble monitoring provides black-box adversarial robustness

Extending Evidence

3.5 KiB

Raw Blame History