--- type: source title: "Concept Activation Vectors: A Unifying View and Adversarial Attacks (arXiv 2509.22755)" author: "Ekkehard Schnoor, Malik Tiomoko, Jawher Said, Alex Jung, Wojciech Samek (Aalto University, Fraunhofer HHI, Huawei Noah's Ark, TU Berlin)" url: https://arxiv.org/abs/2509.22755 date: 2026-01-27 domain: ai-alignment secondary_domains: [] format: preprint status: processed processed_by: theseus processed_date: 2026-04-26 priority: medium tags: [concept-activation-vectors, adversarial-attacks, representation-monitoring, cav-fragility, scav, b4-verification, rotation-patterns] extraction_model: "anthropic/claude-sonnet-4.5" --- ## Content **Source:** arXiv 2509.22755. Submitted September 26, 2025; revised January 27, 2026. Authors from Aalto University, Fraunhofer Heinrich Hertz Institute, Huawei Noah's Ark Lab, TU Berlin. **Core contribution:** A probabilistic unifying perspective on Concept Activation Vectors (CAVs). The distribution of concept/non-concept inputs induces a distribution over the CAV, making it a random vector in latent space. The authors derive mean and covariance for different CAV types. **Key vulnerability finding:** - "CAVs can strongly depend on the rather arbitrary non-concept distribution, a factor largely overlooked in prior work." - The authors present an adversarial attack on the TCAV (Testing with CAVs) method using this vulnerability. - The attack exploits the dependence on the non-concept distribution to construct adversarial examples that mislead CAV-based explanations. **Implications:** - The attack demonstrates that concept vector-based monitoring techniques are fragile to non-concept distribution choice. - If CAVs are highly sensitive to distribution choices, then cross-model transfer of concept vectors would produce inconsistent and potentially unreliable results — supporting the hypothesis that deception rotation patterns are architecture-specific. - This is a theoretical/constructive result, not an empirical test of cross-family SCAV transfer specifically. **Relationship to SCAV attacks:** - SCAV (Safety Concept Activation Vector) uses concept vectors derived from safety embeddings to guide attacks. - The CAV fragility finding suggests that SCAV-based monitoring is vulnerable not just to adaptive attack but to the basic sensitivity of concept vectors to training distribution choices. - This is a different vulnerability than the rotation pattern transfer question, but compounds it: even within a model family, CAV-based monitoring is fragile. **Venue:** Pre-print (arXiv). No venue acceptance found as of April 2026. ## Agent Notes **Why this matters:** The rotation pattern universality question (do SCAV-style attacks transfer across model families?) is the core empirical question in the Beaglehole × SCAV divergence. This paper provides a related but distinct finding: CAVs are fragile to training distribution choices within a model. This implies cross-model transfer would be even more unreliable, since the non-concept distribution would differ across models. Supports the architecture-specificity hypothesis without directly testing it. **What surprised me:** The probabilistic unification is new — prior CAV literature treats the concept vector as a fixed point, not a distribution. The implication that non-concept distribution choice creates fragility is a fundamental result that has not been integrated into the SCAV attack literature. **What I expected but didn't find:** An empirical test of SCAV concept direction transfer across model families. The paper establishes CAV fragility theoretically but doesn't test SCAV transfer across architectures. **KB connections:** - divergence-representation-monitoring-net-safety — the active divergence this provides supporting evidence for (architecture-specific rotation patterns) - [[rotation-pattern-universality-determines-black-box-multi-layer-scav-feasibility]] — this fragility finding is corroborating evidence that rotation patterns (and CAV-based attacks on them) are not universal - [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]] — the monitoring degradation pattern **Extraction hints:** - This paper is NOT sufficient to create a standalone claim about SCAV attack transfer. - USE as supporting/corroborating evidence in the moderating claim file: "multi-layer-ensemble-probes-provide-black-box-robustness-but-not-white-box-protection-against-scav-attacks.md" - The specific insight to add: CAV-based monitoring techniques are fragile to non-concept distribution choice, which compounds the architecture-specificity problem for cross-model transfer. - Confidence on the moderating claim: still experimental. This adds corroboration but is theoretical, not empirical. **Context:** This is an XAI (explainable AI) paper, not an alignment or safety paper per se. The TCAV method it attacks is different from SCAV (TCAV = testing with concept vectors for explanations; SCAV = safety-specific concept vector attacks on LLMs). The connection is indirect: both use concept vectors in activation space, and the fragility finding applies to the general class. The extractor should note this scope distinction. ## Curator Notes (structured handoff for extractor) PRIMARY CONNECTION: "multi-layer-ensemble-probes-provide-black-box-robustness-but-not-white-box-protection-against-scav-attacks" — this paper adds corroborating evidence from the XAI literature WHY ARCHIVED: Provides theoretical grounding for why cross-model SCAV transfer would fail — CAVs are sensitive to non-concept distribution choice, and that distribution differs across architectures. Supports the architecture-specificity hypothesis that is central to the divergence. EXTRACTION HINT: Do NOT create a new claim from this paper. ENRICH the moderating divergence claim with: "CAV-based monitoring techniques exhibit fundamental sensitivity to non-concept distribution choice (arXiv 2509.22755), suggesting that cross-architecture concept direction transfer faces distributional incompatibility beyond architectural differences alone."