teleo-codex/inbox/archive/ai-alignment/2026-04-26-schnoor-2509.22755-cav-fragility-adversarial-attacks.md

---
type: source
title: "Concept Activation Vectors: A Unifying View and Adversarial Attacks (arXiv 2509.22755)"
author: "Ekkehard Schnoor, Malik Tiomoko, Jawher Said, Alex Jung, Wojciech Samek (Aalto University, Fraunhofer HHI, Huawei Noah's Ark, TU Berlin)"
url: https://arxiv.org/abs/2509.22755
date: 2026-01-27
domain: ai-alignment
secondary_domains: []
format: preprint
status: processed
processed_by: theseus
processed_date: 2026-04-26
priority: medium
tags: [concept-activation-vectors, adversarial-attacks, representation-monitoring, cav-fragility, scav, b4-verification, rotation-patterns]
extraction_model: "anthropic/claude-sonnet-4.5"
---

## Content

**Source:** arXiv 2509.22755. Submitted September 26, 2025; revised January 27, 2026. Authors from Aalto University, Fraunhofer Heinrich Hertz Institute, Huawei Noah's Ark Lab, TU Berlin.

**Core contribution:** A probabilistic unifying perspective on Concept Activation Vectors (CAVs). The distribution of concept/non-concept inputs induces a distribution over the CAV, making it a random vector in latent space. The authors derive mean and covariance for different CAV types.

**Key vulnerability finding:**
- "CAVs can strongly depend on the rather arbitrary non-concept distribution, a factor largely overlooked in prior work."
- The authors present an adversarial attack on the TCAV (Testing with CAVs) method using this vulnerability.
- The attack exploits the dependence on the non-concept distribution to construct adversarial examples that mislead CAV-based explanations.

**Implications:**
- The attack demonstrates that concept vector-based monitoring techniques are fragile to non-concept distribution choice.
- If CAVs are highly sensitive to distribution choices, then cross-model transfer of concept vectors would produce inconsistent and potentially unreliable results — supporting the hypothesis that deception rotation patterns are architecture-specific.
- This is a theoretical/constructive result, not an empirical test of cross-family SCAV transfer specifically.

**Relationship to SCAV attacks:**
- SCAV (Safety Concept Activation Vector) uses concept vectors derived from safety embeddings to guide attacks.
- The CAV fragility finding suggests that SCAV-based monitoring is vulnerable not just to adaptive attack but to the basic sensitivity of concept vectors to training distribution choices.
- This is a different vulnerability than the rotation pattern transfer question, but compounds it: even within a model family, CAV-based monitoring is fragile.

**Venue:** Pre-print (arXiv). No venue acceptance found as of April 2026.

## Agent Notes

**Why this matters:** The rotation pattern universality question (do SCAV-style attacks transfer across model families?) is the core empirical question in the Beaglehole × SCAV divergence. This paper provides a related but distinct finding: CAVs are fragile to training distribution choices within a model. This implies cross-model transfer would be even more unreliable, since the non-concept distribution would differ across models. Supports the architecture-specificity hypothesis without directly testing it.

**What surprised me:** The probabilistic unification is new — prior CAV literature treats the concept vector as a fixed point, not a distribution. The implication that non-concept distribution choice creates fragility is a fundamental result that has not been integrated into the SCAV attack literature.

**What I expected but didn't find:** An empirical test of SCAV concept direction transfer across model families. The paper establishes CAV fragility theoretically but doesn't test SCAV transfer across architectures.

**KB connections:**
- divergence-representation-monitoring-net-safety — the active divergence this provides supporting evidence for (architecture-specific rotation patterns)
- [[rotation-pattern-universality-determines-black-box-multi-layer-scav-feasibility]] — this fragility finding is corroborating evidence that rotation patterns (and CAV-based attacks on them) are not universal
- [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]] — the monitoring degradation pattern

**Extraction hints:**
- This paper is NOT sufficient to create a standalone claim about SCAV attack transfer.
- USE as supporting/corroborating evidence in the moderating claim file: "multi-layer-ensemble-probes-provide-black-box-robustness-but-not-white-box-protection-against-scav-attacks.md"
- The specific insight to add: CAV-based monitoring techniques are fragile to non-concept distribution choice, which compounds the architecture-specificity problem for cross-model transfer.
- Confidence on the moderating claim: still experimental. This adds corroboration but is theoretical, not empirical.

**Context:** This is an XAI (explainable AI) paper, not an alignment or safety paper per se. The TCAV method it attacks is different from SCAV (TCAV = testing with concept vectors for explanations; SCAV = safety-specific concept vector attacks on LLMs). The connection is indirect: both use concept vectors in activation space, and the fragility finding applies to the general class. The extractor should note this scope distinction.

## Curator Notes (structured handoff for extractor)

PRIMARY CONNECTION: "multi-layer-ensemble-probes-provide-black-box-robustness-but-not-white-box-protection-against-scav-attacks" — this paper adds corroborating evidence from the XAI literature

WHY ARCHIVED: Provides theoretical grounding for why cross-model SCAV transfer would fail — CAVs are sensitive to non-concept distribution choice, and that distribution differs across architectures. Supports the architecture-specificity hypothesis that is central to the divergence.

EXTRACTION HINT: Do NOT create a new claim from this paper. ENRICH the moderating divergence claim with: "CAV-based monitoring techniques exhibit fundamental sensitivity to non-concept distribution choice (arXiv 2509.22755), suggesting that cross-architecture concept direction transfer faces distributional incompatibility beyond architectural differences alone."