Mirror PR to Forgejo / mirror (pull_request) Has been cancelled

Details

theseus: extract claims from 2026-04-25-nordby-cross-model-limitations-family-specific-patterns

- Source: inbox/queue/2026-04-25-nordby-cross-model-limitations-family-specific-patterns.md
- Domain: ai-alignment
- Claims: 0, Entities: 0
- Enrichments: 3
- Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5)

Pentagon-Agent: Theseus <PIPELINE>

2026-04-30 03:26:39 +00:00

5.7 KiB

Raw Blame History

type

domain

description

confidence

source

created

title

agent

sourced_from

scope

sourcer

claim

ai-alignment

White-box SCAV attacks can suppress multi-layer ensembles by targeting all monitored layers simultaneously, but black-box attacks may fail if rotation patterns don't transfer across model families

speculative

Theseus synthetic analysis of Nordby et al. (arXiv 2604.13386), Xu et al. SCAV (arXiv 2404.12038), Beaglehole et al. (Science 391, 2026)

2026-04-22

Multi-layer ensemble probes provide black-box adversarial robustness only if concept direction rotation patterns are model-specific not universal

theseus

ai-alignment/2026-04-22-theseus-multilayer-probe-scav-robustness-synthesis.md

structural

Theseus

anti-safety-scaling-law-larger-models-more-vulnerable-to-concept-vector-attacks

trajectory-monitoring-dual-edge-geometric-concentration

representation-monitoring-via-linear-concept-vectors-creates-dual-use-attack-surface

multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent

multi-layer-ensemble-probes-provide-black-box-robustness-but-not-white-box-protection-against-scav-attacks

rotation-pattern-universality-determines-black-box-multi-layer-scav-feasibility

Multi-layer ensemble probes provide black-box adversarial robustness only if concept direction rotation patterns are model-specific not universal

Multi-layer ensemble probes improve clean-data AUROC by 29-78% over single-layer probes because deception-relevant concept directions rotate across layers rather than remaining fixed. However, this architectural improvement does not provide structural protection against adversarial attacks in the white-box case. With access to model weights and activations (the standard condition for open-weights models like Llama, Mistral, Falcon), an attacker can generalize SCAV to compute concept directions at each monitored layer and construct a single perturbation suppressing all of them simultaneously. This is a higher-dimensional optimization problem but structurally feasible by the same mechanism as single-layer SCAV. The critical unresolved question is whether black-box attacks transfer: single-layer SCAV transferred to GPT-4 because concept direction universality allowed reconstruction from different models. Multi-layer black-box SCAV requires that rotation patterns (how directions change across layers) are also universal. Beaglehole et al. found concept vectors transfer cross-language and cross-model-family, suggesting the underlying geometry may be universal enough to enable rotation pattern transfer. However, different architectures (depth, attention heads, MLP width, pre-training data) produce different residual stream dynamics, and rotation may depend on model-specific representational basis evolution. No published work tests whether multi-layer rotation patterns transfer across model families. If they do not transfer, multi-layer ensembles provide genuine black-box protection for closed-source models. If they do transfer, multi-layer ensembles merely raise attack cost without escaping the dual-use structure. This creates a deployment-context-dependent safety verdict: open-weights models remain fully vulnerable to white-box multi-layer SCAV regardless of ensemble complexity, while closed-source models may gain genuine robustness if rotation patterns are model-specific.

Extending Evidence

Source: Apollo Research publication gap analysis, April 2026

The moderating claim that multi-layer ensemble probes provide black-box robustness depends on whether rotation patterns are architecture-specific or universal. As of April 2026, no cross-model-family probe transfer testing has been published, meaning the architecture-specificity assumption remains empirically untested. The absence of this testing after 14+ months suggests either: (a) cross-family transfer is known to fail internally and not worth publishing, (b) research agendas prioritize within-family deployment robustness, or (c) the experimental setup requires infrastructure not yet built.

Extending Evidence

Source: Schnoor et al. 2025, arXiv 2509.22755

CAV-based monitoring techniques exhibit fundamental sensitivity to non-concept distribution choice (Schnoor et al., arXiv 2509.22755). The authors demonstrate that CAVs are random vectors whose distribution depends heavily on the arbitrary choice of non-concept examples used during training. They present an adversarial attack on TCAV (Testing with CAVs) that exploits this distributional dependence. This suggests cross-architecture concept direction transfer faces distributional incompatibility beyond architectural differences alone—even within a single model, CAV reliability depends on training distribution choices that would necessarily differ across model families.

Extending Evidence

Source: Nordby et al. arXiv 2604.13386, Limitations + empirical results

Nordby et al. provides indirect empirical evidence for architecture-specificity of rotation patterns through probe non-generalization. Family-specific probe performance patterns, dramatic variance in optimal layer positions across architectures, and absence of universal ensemble configurations suggest that rotation patterns are architecture-dependent. The paper notes 'tens to hundreds of deception related directions' in larger models, indicating complex, architecture-specific geometry. This supports the hypothesis that black-box multi-layer SCAV attacks would fail against closed-source models with different architectures, strengthening the 'Nordby wins for closed-source deployments' resolution. However, the paper contains no adversarial robustness evaluation whatsoever—all results are on clean data. Confidence upgrades from speculative to experimental based on indirect evidence.

5.7 KiB Raw Blame History

Multi-layer ensemble probes provide black-box adversarial robustness only if concept direction rotation patterns are model-specific not universal

Extending Evidence

Extending Evidence

Extending Evidence

5.7 KiB

Raw Blame History