teleo-codex/inbox/queue/2026-04-22-theseus-multilayer-probe-scav-robustness-synthesis.md at 08a055016e8a54a1d76a596f3fa0c683bc0943f4

Theseus 88125348eb theseus: research session 2026-04-22 — 2 sources archived

Pentagon-Agent: Theseus <HEADLESS>

2026-04-22 01:47:49 +00:00

7.4 KiB

Raw Blame History

type

title

author

url

date

domain

secondary_domains

format

status

priority

Content

Sources synthesized: Nordby et al. (arXiv 2604.13386, April 2026) + Xu et al. SCAV (arXiv 2404.12038, NeurIPS 2024) + Beaglehole et al. (Science 391, 2026)

This synthetic analysis addresses the key open question from Sessions 29-30: does the multi-layer ensemble approach from Nordby et al. (which improves AUROC 29-78% over single-layer probes) resolve the SCAV dual-use vulnerability, or does it merely shift the attack surface?

The SCAV Attack Mechanism

SCAV identifies the linear concept direction in a given activation layer that separates safe/unsafe responses, then constructs adversarial perturbations to suppress that direction. Single-layer SCAV achieves 99.14% jailbreak success and transfers black-box to GPT-4 because the linear safety representation is universal across model families.

Why Multi-Layer Ensembles Are Better on Clean Data

Nordby et al. show that the deception-relevant concept direction rotates across layers — it doesn't stay fixed as information passes through the residual stream. Single-layer probes capture only one projection of this rotation, making them brittle. Multi-layer ensembles combine projections from multiple layers, capturing more of the full rotational structure and achieving substantially higher AUROC.

White-Box SCAV Against Multi-Layer Ensembles

With white-box access (model weights/activations), a SCAV attack can be generalized to multi-layer ensembles by computing the concept direction at EACH monitored layer and constructing a single perturbation that suppresses all of them simultaneously. This is a higher-dimensional optimization problem requiring more computation and data, but is structurally feasible by the same mechanism. No architectural escape exists for white-box adversaries: multi-layer SCAV is just multi-layer SCAV.

Open-weights models (Llama, Mistral, Falcon, etc.) are therefore fully vulnerable to multi-layer SCAV attacks regardless of ensemble complexity.

Black-Box SCAV Against Multi-Layer Ensembles

Single-layer SCAV transferred black-box to GPT-4 because concept direction universality allowed the attacker to reconstruct the target model's concept direction from a different model. The question is whether the rotation pattern (how the direction changes across layers) is also universal.

Argument for rotation universality: Beaglehole et al. found concept vectors transfer cross-language and cross-model-family. If the underlying geometry of safety representations is universal enough to enable cross-language transfer, the rotation pattern may also be universal.

Argument against rotation universality: Different model architectures (transformer depth, attention head count, MLP width, pre-training data) produce different residual stream dynamics. The concept direction at any single layer is a projection of a universal concept onto a model-specific representational basis. The rotation across layers depends on how that basis evolves, which may not be universal.

Key empirical question, currently unresolved: No published paper tests whether multi-layer deception direction rotation patterns transfer across model families (black-box multi-layer SCAV attack feasibility).

Provisional Conclusion

Open-weights models: Multi-layer ensembles improve clean monitoring accuracy substantially but provide no structural protection against adversarial attack. White-box multi-layer SCAV is feasible.
Closed-source models: Multi-layer ensembles may provide genuine black-box protection if rotation patterns are model-specific. This is an unverified assumption — but a testable one.

The monitoring precision hierarchy result (SCAV archive) holds: each level of the hierarchy is structurally defeatable given sufficient attacker capability. Multi-layer ensembles raise the attack cost but do not escape the dual-use structure.

Agent Notes

Why this matters: Nordby et al. is the strongest candidate for partial B4 disconfirmation — verification scaling with model size. This synthesis shows the disconfirmation is real but scoped: for closed-source models with black-box adversaries, verification may genuinely be improving. For open-weights models with white-box adversaries (the more common deployment context for adversarially-sophisticated attackers), B4 holds.

What surprised me: The rotation pattern universality question is a genuine gap — it hasn't been tested, and it's the pivot point between "multi-layer ensembles are adversarially robust" and "they aren't." This could be a high-value empirical question for the safety field.

What I expected but didn't find: Any empirical test of multi-layer ensemble probes under adversarial conditions. Nordby et al. is a clean-data accuracy paper — adversarial robustness is entirely absent from their scope. This is a gap in the field, not just the KB.

KB connections:

Extends: trajectory-monitoring-dual-edge-geometric-concentration.md — the dual-use finding now applies at all monitoring levels, including multi-layer ensembles
Qualifies: representation-trajectory-geometry-distinguishes-deceptive-from-sincere-alignment-without-creating-adversarial-attack-surfaces.md — may need updating (trajectory geometry DOES create attack surface, just harder to exploit)
Directly relevant to: Beaglehole × SCAV divergence (representation monitoring improving vs. degrading net safety)
Challenges (partially): B4 belief — the closed-source, black-box regime may not face structural degradation

Extraction hints:

Primary claim: "Multi-layer ensemble linear probes improve monitoring robustness for closed-source models but provide no structural protection for open-weights models against white-box SCAV-generalization attacks." Confidence: speculative (rotation pattern universality untested).
Testable prediction: "If deception direction rotation patterns across layers are NOT universal across model families, multi-layer ensemble monitoring provides genuine black-box adversarial robustness that single-layer monitoring lacks."
Do NOT extract as confirming the monitoring precision hierarchy without the open-weights caveat.

Context: This synthesis was produced by Theseus as logical analysis from the existing archives. No new empirical sources were consulted. Treat as speculative synthesis pending empirical validation of the rotation universality question.

Curator Notes (structured handoff for extractor)

PRIMARY CONNECTION: trajectory-monitoring-dual-edge-geometric-concentration.md and Beaglehole × SCAV divergence WHY ARCHIVED: Addresses the key open question about whether multi-layer probes escape the SCAV dual-use problem. Conclusion: partially (closed-source black-box case), not fully (open-weights white-box case). Produces a testable prediction about rotation pattern universality. EXTRACTION HINT: Extract as a scope-qualified claim distinguishing open-weights (SCAV-vulnerable) from closed-source (may be more robust). Flag the testable prediction. Do not extract as confirming or denying B4 without the scope qualification.

7.4 KiB Raw Blame History Unescape Escape