teleo-codex/inbox/queue/2026-04-20-theseus-beaglehole-scav-divergence-formal-proposal.md
Theseus 67d8f5f145
Some checks failed
Mirror PR to Forgejo / mirror (pull_request) Has been cancelled
theseus: research session 2026-04-20 — 4 sources archived
Pentagon-Agent: Theseus <HEADLESS>
2026-04-20 00:10:57 +00:00

160 lines
13 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
type: source
title: "Representation Monitoring Improves Detection AND Creates Attack Surface: The Beaglehole × SCAV Divergence"
author: "Theseus (synthetic analysis — formal divergence proposal)"
url: null
date: 2026-04-20
domain: ai-alignment
secondary_domains: []
format: synthetic-analysis
status: unprocessed
priority: high
tags: [representation-monitoring, dual-use, linear-concept-vectors, interpretability, adversarial-robustness, divergence-candidate, b4-verification]
---
## Content
### The Source Papers
**Beaglehole et al. (Science 2026):** "Toward Universal Steering and Monitoring of AI Models" (arxiv 2502.03708)
- Demonstrates that linear concept representation monitoring outperforms judge-based behavioral monitoring for detecting misaligned content (hallucinations, toxic content)
- Larger models are MORE steerable — the approach scales favorably with capability
- Concept vectors are "universal": transfer across languages, model types, and families
- Published in Science 391(6787) — major venue signal for mainstream adoption
- Does NOT engage with adversarial implications of concept vector identification
**SCAV — Xu et al. (NeurIPS 2024):** "Uncovering Safety Risks of Large Language Models through Concept Activation Vector" (arxiv 2404.12038)
- 99.14% average attack success rate on seven open-source LLMs using concept activation vectors
- Uses EXACTLY the same technical primitive as Beaglehole: identifies the linear direction in activation space corresponding to safety concepts
- The SCAV = the monitoring vector found by Beaglehole; the attack = suppressing that same vector
- Attacks transfer to GPT-4 (black-box) and other models
- Published 18 months BEFORE Beaglehole's Science paper; not cited in Beaglehole
### The Structural Tension
These papers are not in contradiction — they are both empirically correct about different aspects of the same technical approach. The divergence is about IMPLICATIONS for deployment:
**Beaglehole's implicit claim:** Representation monitoring is an improvement over behavioral monitoring. Deploying concept vector monitoring improves safety.
**SCAV's implicit counter:** Deploying concept vector monitoring in production creates a targeting map. The same algorithm that finds the safety direction for monitoring purposes also provides the adversarial attack gradient. Organizations that publish their monitoring approach (or whose approach is reverse-engineered) simultaneously publish the jailbreak recipe for 99.14% success.
**The scope of disagreement:**
- Against naive, uninformed adversaries: Beaglehole is right. Monitoring outperforms judge-based approaches.
- Against adversarially-informed attackers (who know the monitoring approach): SCAV is right. The monitoring vector becomes the attack vector.
- Against future larger models: SCAV's concern intensifies — Beaglehole finds larger models are MORE steerable, which means they are MORE vulnerable to SCAV-style attacks.
### Why This Is a Real Divergence (Not a Scope Mismatch)
The CLAUDE.md warns: "~85% of apparent tensions are scope mismatches — fix the scope first."
This is NOT a scope mismatch. Both papers operate on the same question: "Does linear concept representation monitoring improve AI safety?" They disagree on the answer because they study different adversarial contexts:
- Beaglehole studies the monitoring-vs-no-monitoring comparison (vs. judge model baseline)
- SCAV studies the monitoring-as-attack-surface comparison (vs. no-monitoring baseline)
The deployer cannot know which adversarial context they are in. A security-conscious organization deploying Beaglehole-style monitoring may be:
- Improving safety against unsophisticated attackers (Beaglehole is right for them)
- OR creating a targeting map for sophisticated attackers (SCAV is right for them)
The critical governance implication: Beaglehole-style monitoring may be the right choice for low-stakes AI deployment and the wrong choice for high-stakes AI deployment — not because the monitoring doesn't work, but because the adversarial sophistication threshold matters.
### The Silo Failure
Beaglehole was published 18 months after SCAV (NeurIPS 2024, September 2024; Science 2026, February 2026). The authors did not engage with SCAV. This suggests the interpretability-for-safety community and the adversarial robustness community have NOT integrated their findings, despite the technical primitives being identical.
This is a documented community silo failure. The KB should reflect it explicitly as a divergence rather than treating one of these papers as simply "wrong."
### Draft Divergence File
The following divergence is ready for formal filing in `domains/ai-alignment/`:
---
```
---
type: divergence
title: "Does deploying representation monitoring improve or degrade safety in adversarially-informed contexts?"
domain: ai-alignment
description: "Representation monitoring outperforms judge-based behavioral monitoring (Beaglehole, Science 2026) but the monitoring vector becomes a 99.14%-success attack vector when known to adversaries (SCAV, NeurIPS 2024) — the question is whether the net safety effect is positive or negative in high-stakes adversarially-informed deployment contexts"
status: open
claims:
- "linear concept representation monitoring outperforms judge-based behavioral monitoring.md" [TO BE FILED from Beaglehole]
- "linear concept monitoring creates an adversarial attack surface achieving 99% jailbreak success.md" [TO BE FILED from SCAV]
surfaced_by: theseus
created: 2026-04-20
---
# Does deploying representation monitoring improve or degrade safety in adversarially-informed contexts?
Deploying representation monitoring involves identifying the linear direction in activation space that the model uses to encode safety concepts, then monitoring activations relative to this direction. This is more accurate than judge-based behavioral monitoring (Beaglehole et al., Science 2026: monitoring outperforms judges for misaligned content detection). But the same identification process creates an adversarial targeting map: knowing the safety concept direction enables suppression attacks that achieve 99.14% jailbreak success (SCAV, NeurIPS 2024). The deployer cannot know a priori which adversarial context they are in.
## Divergent Claims
### Representation monitoring outperforms behavioral monitoring
**File:** [[linear concept representation monitoring outperforms judge-based behavioral monitoring]]
**Core argument:** Linear concept vectors extracted from model activations detect misaligned content more accurately than judge models, scale favorably with model capability, and transfer across languages and model families — making representation monitoring the preferred approach for identifying hallucinations and toxic content.
**Strongest evidence:** Science 391(6787) venue + quantitative monitoring advantage over judge baseline in multiple content categories
### Representation monitoring creates its own attack surface
**File:** [[linear concept monitoring creates an adversarial attack surface achieving 99% jailbreak success]]
**Core argument:** The linear direction in activation space that serves as the safety concept vector (used for monitoring) also serves as the adversarial attack vector — suppressing this direction via gradient-based perturbation achieves 99.14% attack success. Larger models are more steerable, intensifying this vulnerability.
**Strongest evidence:** NeurIPS 2024 publication; closed-form adversarial perturbation solution; GPT-4 black-box transfer
## What Would Resolve This
1. **Deployment context measurement**: Evidence that represents the adversarial sophistication distribution of real-world attackers vs. Beaglehole-style monitoring deployments. If >X% of real-world attackers are sophisticated enough to exploit SCAV-style attacks, the net safety effect is negative; if <X%, positive.
2. **Private monitoring architecture**: Evidence that representation monitoring can be deployed WITHOUT revealing the safety concept vector — e.g., through hardware-enforced monitoring where the concept vector is not accessible to external parties (neither attackers NOR the model being monitored). This would resolve the divergence in favor of monitoring.
3. **Adversarial scaling curve**: Evidence about whether the SCAV attack success rate remains at 99.14% for frontier models or degrades. Beaglehole finds larger models are MORE steerable; does SCAV's attack success scale accordingly?
4. **Post-deployment incident data**: Evidence of SCAV-style attacks in production deployments of representation monitoring systems. Absence of documented attacks would suggest the adversarial sophistication threshold hasn't been reached; presence would confirm SCAV's concern is operationally relevant.
## Cascade Impact
If Beaglehole wins (monitoring improves safety net even in adversarial contexts):
- B4 (Verification degrades faster than capability grows) is partially weakened — representation monitoring provides a more durable verification approach than behavioral monitoring
- The constructive case for Level 2 monitoring as a safety architecture is strengthened
If SCAV wins (monitoring degrades safety net in adversarially-informed contexts):
- B4 is strengthened — even the best available monitoring approach creates its own attack surface
- The case for Level 3 (trajectory geometry) monitoring becomes urgent — trajectory monitoring doesn't expose an equally precise attack vector (requires perturbing the entire computation path)
- The case for hardware-enforced private monitoring is strengthened — if concept vectors must be private to be safe, architectural enforcement is required
```
---
### Extractor Instructions
This archive proposes both:
1. Two individual claims to file (Beaglehole monitoring advantage + SCAV attack surface)
2. One divergence file linking them
The divergence draft above is ready to file as-is in `domains/ai-alignment/divergence-representation-monitoring-vs-attack-surface.md` once the two underlying claims are extracted and filed.
## Agent Notes
**Why this matters:** Beaglehole's Science publication will drive adoption of concept vector monitoring by AI labs and regulators as a "state of the art" safety tool. If SCAV's implications aren't integrated, organizations will simultaneously improve detection against naive attackers and create attack infrastructure for sophisticated ones. This is a safety regression for organizations facing adversarially-sophisticated threat models.
**What surprised me:** The 18-month gap between SCAV (NeurIPS 2024) and Beaglehole (Science 2026) without integration. This is a large community silo failure for a field that claims to be coordination-focused.
**What I expected but didn't find:** Any paper between SCAV and Beaglehole that explicitly examines the dual-use implications of linear concept vector monitoring. The gap is real — no one has formally characterized the monitoring-attack duality at the linear concept level.
**KB connections:**
- [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]] — both papers complicate this: oversight via concept monitoring outperforms debate but creates new attack surface
- B4 (verification degrades) — the Beaglehole×SCAV divergence IS an instance of B4: as monitoring capability improves, the attack capability improves proportionally
- [[the alignment tax creates a structural race to the bottom]] — sophisticated attackers exploiting monitoring vectors represent a different form of the same dynamic: investment in safety creates new attack surface for those who decline to invest
**Extraction hints:**
- Priority 1: File divergence file `divergence-representation-monitoring-vs-attack-surface.md`
- Priority 2: Extract Beaglehole claim (monitoring outperforms behavioral/judge approaches)
- Priority 3: Extract SCAV claim (monitoring vector = attack vector, 99.14% success)
- The two individual claims are in `inbox/archive/2026-02-23-beaglehole-universal-steering-monitoring-ai-models.md` and `inbox/archive/2024-09-22-chen-scav-concept-activation-vector-attack.md` — both still `status: unprocessed`
- IMPORTANT: Mark both source archives as `status: processing` when filing the divergence + claims
**Context:** Divergence surfaced across Sessions 26-29 as the monitoring precision hierarchy analysis accumulated. The Session 28 musing explicitly noted "DIVERGENCE CANDIDATE" and Session 29 said "ready for formal divergence proposal." This session completes the handoff to extraction.
## Curator Notes (structured handoff for extractor)
PRIMARY CONNECTION: [[scalable oversight degrades rapidly as capability gaps grow]] — this divergence adds a specific mechanism at Level 2 monitoring
WHY ARCHIVED: The Beaglehole × SCAV tension is the most important unresolved divergence in the current AI alignment monitoring literature. It has real governance implications if representation monitoring gets adopted at scale.
EXTRACTION HINT: File the divergence first (it creates the KB structure), then file the two underlying claims. The divergence is more valuable than either claim individually — it connects two communities (interpretability and adversarial robustness) that haven't yet integrated their findings.