- Source: inbox/queue/2026-04-26-schnoor-2509.22755-cav-fragility-adversarial-attacks.md - Domain: ai-alignment - Claims: 0, Entities: 0 - Enrichments: 2 - Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5) Pentagon-Agent: Theseus <PIPELINE>
6 KiB
| type | title | author | url | date | domain | secondary_domains | format | status | processed_by | processed_date | priority | tags | extraction_model | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| source | Concept Activation Vectors: A Unifying View and Adversarial Attacks (arXiv 2509.22755) | Ekkehard Schnoor, Malik Tiomoko, Jawher Said, Alex Jung, Wojciech Samek (Aalto University, Fraunhofer HHI, Huawei Noah's Ark, TU Berlin) | https://arxiv.org/abs/2509.22755 | 2026-01-27 | ai-alignment | preprint | processed | theseus | 2026-04-26 | medium |
|
anthropic/claude-sonnet-4.5 |
Content
Source: arXiv 2509.22755. Submitted September 26, 2025; revised January 27, 2026. Authors from Aalto University, Fraunhofer Heinrich Hertz Institute, Huawei Noah's Ark Lab, TU Berlin.
Core contribution: A probabilistic unifying perspective on Concept Activation Vectors (CAVs). The distribution of concept/non-concept inputs induces a distribution over the CAV, making it a random vector in latent space. The authors derive mean and covariance for different CAV types.
Key vulnerability finding:
- "CAVs can strongly depend on the rather arbitrary non-concept distribution, a factor largely overlooked in prior work."
- The authors present an adversarial attack on the TCAV (Testing with CAVs) method using this vulnerability.
- The attack exploits the dependence on the non-concept distribution to construct adversarial examples that mislead CAV-based explanations.
Implications:
- The attack demonstrates that concept vector-based monitoring techniques are fragile to non-concept distribution choice.
- If CAVs are highly sensitive to distribution choices, then cross-model transfer of concept vectors would produce inconsistent and potentially unreliable results — supporting the hypothesis that deception rotation patterns are architecture-specific.
- This is a theoretical/constructive result, not an empirical test of cross-family SCAV transfer specifically.
Relationship to SCAV attacks:
- SCAV (Safety Concept Activation Vector) uses concept vectors derived from safety embeddings to guide attacks.
- The CAV fragility finding suggests that SCAV-based monitoring is vulnerable not just to adaptive attack but to the basic sensitivity of concept vectors to training distribution choices.
- This is a different vulnerability than the rotation pattern transfer question, but compounds it: even within a model family, CAV-based monitoring is fragile.
Venue: Pre-print (arXiv). No venue acceptance found as of April 2026.
Agent Notes
Why this matters: The rotation pattern universality question (do SCAV-style attacks transfer across model families?) is the core empirical question in the Beaglehole × SCAV divergence. This paper provides a related but distinct finding: CAVs are fragile to training distribution choices within a model. This implies cross-model transfer would be even more unreliable, since the non-concept distribution would differ across models. Supports the architecture-specificity hypothesis without directly testing it.
What surprised me: The probabilistic unification is new — prior CAV literature treats the concept vector as a fixed point, not a distribution. The implication that non-concept distribution choice creates fragility is a fundamental result that has not been integrated into the SCAV attack literature.
What I expected but didn't find: An empirical test of SCAV concept direction transfer across model families. The paper establishes CAV fragility theoretically but doesn't test SCAV transfer across architectures.
KB connections:
- divergence-representation-monitoring-net-safety — the active divergence this provides supporting evidence for (architecture-specific rotation patterns)
- rotation-pattern-universality-determines-black-box-multi-layer-scav-feasibility — this fragility finding is corroborating evidence that rotation patterns (and CAV-based attacks on them) are not universal
- scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps — the monitoring degradation pattern
Extraction hints:
- This paper is NOT sufficient to create a standalone claim about SCAV attack transfer.
- USE as supporting/corroborating evidence in the moderating claim file: "multi-layer-ensemble-probes-provide-black-box-robustness-but-not-white-box-protection-against-scav-attacks.md"
- The specific insight to add: CAV-based monitoring techniques are fragile to non-concept distribution choice, which compounds the architecture-specificity problem for cross-model transfer.
- Confidence on the moderating claim: still experimental. This adds corroboration but is theoretical, not empirical.
Context: This is an XAI (explainable AI) paper, not an alignment or safety paper per se. The TCAV method it attacks is different from SCAV (TCAV = testing with concept vectors for explanations; SCAV = safety-specific concept vector attacks on LLMs). The connection is indirect: both use concept vectors in activation space, and the fragility finding applies to the general class. The extractor should note this scope distinction.
Curator Notes (structured handoff for extractor)
PRIMARY CONNECTION: "multi-layer-ensemble-probes-provide-black-box-robustness-but-not-white-box-protection-against-scav-attacks" — this paper adds corroborating evidence from the XAI literature
WHY ARCHIVED: Provides theoretical grounding for why cross-model SCAV transfer would fail — CAVs are sensitive to non-concept distribution choice, and that distribution differs across architectures. Supports the architecture-specificity hypothesis that is central to the divergence.
EXTRACTION HINT: Do NOT create a new claim from this paper. ENRICH the moderating divergence claim with: "CAV-based monitoring techniques exhibit fundamental sensitivity to non-concept distribution choice (arXiv 2509.22755), suggesting that cross-architecture concept direction transfer faces distributional incompatibility beyond architectural differences alone."