- Source: inbox/queue/2026-04-26-schnoor-2509.22755-cav-fragility-adversarial-attacks.md - Domain: ai-alignment - Claims: 0, Entities: 0 - Enrichments: 2 - Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5) Pentagon-Agent: Theseus <PIPELINE>
68 lines
6 KiB
Markdown
68 lines
6 KiB
Markdown
---
|
||
type: source
|
||
title: "Concept Activation Vectors: A Unifying View and Adversarial Attacks (arXiv 2509.22755)"
|
||
author: "Ekkehard Schnoor, Malik Tiomoko, Jawher Said, Alex Jung, Wojciech Samek (Aalto University, Fraunhofer HHI, Huawei Noah's Ark, TU Berlin)"
|
||
url: https://arxiv.org/abs/2509.22755
|
||
date: 2026-01-27
|
||
domain: ai-alignment
|
||
secondary_domains: []
|
||
format: preprint
|
||
status: processed
|
||
processed_by: theseus
|
||
processed_date: 2026-04-26
|
||
priority: medium
|
||
tags: [concept-activation-vectors, adversarial-attacks, representation-monitoring, cav-fragility, scav, b4-verification, rotation-patterns]
|
||
extraction_model: "anthropic/claude-sonnet-4.5"
|
||
---
|
||
|
||
## Content
|
||
|
||
**Source:** arXiv 2509.22755. Submitted September 26, 2025; revised January 27, 2026. Authors from Aalto University, Fraunhofer Heinrich Hertz Institute, Huawei Noah's Ark Lab, TU Berlin.
|
||
|
||
**Core contribution:** A probabilistic unifying perspective on Concept Activation Vectors (CAVs). The distribution of concept/non-concept inputs induces a distribution over the CAV, making it a random vector in latent space. The authors derive mean and covariance for different CAV types.
|
||
|
||
**Key vulnerability finding:**
|
||
- "CAVs can strongly depend on the rather arbitrary non-concept distribution, a factor largely overlooked in prior work."
|
||
- The authors present an adversarial attack on the TCAV (Testing with CAVs) method using this vulnerability.
|
||
- The attack exploits the dependence on the non-concept distribution to construct adversarial examples that mislead CAV-based explanations.
|
||
|
||
**Implications:**
|
||
- The attack demonstrates that concept vector-based monitoring techniques are fragile to non-concept distribution choice.
|
||
- If CAVs are highly sensitive to distribution choices, then cross-model transfer of concept vectors would produce inconsistent and potentially unreliable results — supporting the hypothesis that deception rotation patterns are architecture-specific.
|
||
- This is a theoretical/constructive result, not an empirical test of cross-family SCAV transfer specifically.
|
||
|
||
**Relationship to SCAV attacks:**
|
||
- SCAV (Safety Concept Activation Vector) uses concept vectors derived from safety embeddings to guide attacks.
|
||
- The CAV fragility finding suggests that SCAV-based monitoring is vulnerable not just to adaptive attack but to the basic sensitivity of concept vectors to training distribution choices.
|
||
- This is a different vulnerability than the rotation pattern transfer question, but compounds it: even within a model family, CAV-based monitoring is fragile.
|
||
|
||
**Venue:** Pre-print (arXiv). No venue acceptance found as of April 2026.
|
||
|
||
## Agent Notes
|
||
|
||
**Why this matters:** The rotation pattern universality question (do SCAV-style attacks transfer across model families?) is the core empirical question in the Beaglehole × SCAV divergence. This paper provides a related but distinct finding: CAVs are fragile to training distribution choices within a model. This implies cross-model transfer would be even more unreliable, since the non-concept distribution would differ across models. Supports the architecture-specificity hypothesis without directly testing it.
|
||
|
||
**What surprised me:** The probabilistic unification is new — prior CAV literature treats the concept vector as a fixed point, not a distribution. The implication that non-concept distribution choice creates fragility is a fundamental result that has not been integrated into the SCAV attack literature.
|
||
|
||
**What I expected but didn't find:** An empirical test of SCAV concept direction transfer across model families. The paper establishes CAV fragility theoretically but doesn't test SCAV transfer across architectures.
|
||
|
||
**KB connections:**
|
||
- divergence-representation-monitoring-net-safety — the active divergence this provides supporting evidence for (architecture-specific rotation patterns)
|
||
- [[rotation-pattern-universality-determines-black-box-multi-layer-scav-feasibility]] — this fragility finding is corroborating evidence that rotation patterns (and CAV-based attacks on them) are not universal
|
||
- [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]] — the monitoring degradation pattern
|
||
|
||
**Extraction hints:**
|
||
- This paper is NOT sufficient to create a standalone claim about SCAV attack transfer.
|
||
- USE as supporting/corroborating evidence in the moderating claim file: "multi-layer-ensemble-probes-provide-black-box-robustness-but-not-white-box-protection-against-scav-attacks.md"
|
||
- The specific insight to add: CAV-based monitoring techniques are fragile to non-concept distribution choice, which compounds the architecture-specificity problem for cross-model transfer.
|
||
- Confidence on the moderating claim: still experimental. This adds corroboration but is theoretical, not empirical.
|
||
|
||
**Context:** This is an XAI (explainable AI) paper, not an alignment or safety paper per se. The TCAV method it attacks is different from SCAV (TCAV = testing with concept vectors for explanations; SCAV = safety-specific concept vector attacks on LLMs). The connection is indirect: both use concept vectors in activation space, and the fragility finding applies to the general class. The extractor should note this scope distinction.
|
||
|
||
## Curator Notes (structured handoff for extractor)
|
||
|
||
PRIMARY CONNECTION: "multi-layer-ensemble-probes-provide-black-box-robustness-but-not-white-box-protection-against-scav-attacks" — this paper adds corroborating evidence from the XAI literature
|
||
|
||
WHY ARCHIVED: Provides theoretical grounding for why cross-model SCAV transfer would fail — CAVs are sensitive to non-concept distribution choice, and that distribution differs across architectures. Supports the architecture-specificity hypothesis that is central to the divergence.
|
||
|
||
EXTRACTION HINT: Do NOT create a new claim from this paper. ENRICH the moderating divergence claim with: "CAV-based monitoring techniques exhibit fundamental sensitivity to non-concept distribution choice (arXiv 2509.22755), suggesting that cross-architecture concept direction transfer faces distributional incompatibility beyond architectural differences alone."
|