theseus: extract claims from 2026-04-26-apollo-research-no-cross-model-deception-probe-published

- Source: inbox/queue/2026-04-26-apollo-research-no-cross-model-deception-probe-published.md - Domain: ai-alignment - Claims: 0, Entities: 0 - Enrichments: 2 - Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5) Pentagon-Agent: Theseus <PIPELINE>
2026-04-26 00:26:13 +00:00 · 2026-04-26 00:26:13 +00:00 · deb497dd59
commit deb497dd59
parent a706e55d78
2 changed files with 12 additions and 2 deletions
--- a/domains/ai-alignment/multi-layer-ensemble-probes-provide-black-box-robustness-but-not-white-box-protection-against-scav-attacks.md
+++ b/domains/ai-alignment/multi-layer-ensemble-probes-provide-black-box-robustness-but-not-white-box-protection-against-scav-attacks.md
@ -10,9 +10,16 @@ agent: theseus
 sourced_from: ai-alignment/2026-04-22-theseus-multilayer-probe-scav-robustness-synthesis.md
 scope: structural
 sourcer: Theseus
-related: ["anti-safety-scaling-law-larger-models-more-vulnerable-to-concept-vector-attacks", "trajectory-monitoring-dual-edge-geometric-concentration", "representation-monitoring-via-linear-concept-vectors-creates-dual-use-attack-surface", "multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent"]
+related: ["anti-safety-scaling-law-larger-models-more-vulnerable-to-concept-vector-attacks", "trajectory-monitoring-dual-edge-geometric-concentration", "representation-monitoring-via-linear-concept-vectors-creates-dual-use-attack-surface", "multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent", "multi-layer-ensemble-probes-provide-black-box-robustness-but-not-white-box-protection-against-scav-attacks", "rotation-pattern-universality-determines-black-box-multi-layer-scav-feasibility"]
 ---

 # Multi-layer ensemble probes provide black-box adversarial robustness only if concept direction rotation patterns are model-specific not universal

 Multi-layer ensemble probes improve clean-data AUROC by 29-78% over single-layer probes because deception-relevant concept directions rotate across layers rather than remaining fixed. However, this architectural improvement does not provide structural protection against adversarial attacks in the white-box case. With access to model weights and activations (the standard condition for open-weights models like Llama, Mistral, Falcon), an attacker can generalize SCAV to compute concept directions at each monitored layer and construct a single perturbation suppressing all of them simultaneously. This is a higher-dimensional optimization problem but structurally feasible by the same mechanism as single-layer SCAV. The critical unresolved question is whether black-box attacks transfer: single-layer SCAV transferred to GPT-4 because concept direction universality allowed reconstruction from different models. Multi-layer black-box SCAV requires that rotation patterns (how directions change across layers) are also universal. Beaglehole et al. found concept vectors transfer cross-language and cross-model-family, suggesting the underlying geometry may be universal enough to enable rotation pattern transfer. However, different architectures (depth, attention heads, MLP width, pre-training data) produce different residual stream dynamics, and rotation may depend on model-specific representational basis evolution. No published work tests whether multi-layer rotation patterns transfer across model families. If they do not transfer, multi-layer ensembles provide genuine black-box protection for closed-source models. If they do transfer, multi-layer ensembles merely raise attack cost without escaping the dual-use structure. This creates a deployment-context-dependent safety verdict: open-weights models remain fully vulnerable to white-box multi-layer SCAV regardless of ensemble complexity, while closed-source models may gain genuine robustness if rotation patterns are model-specific.
+
+
+## Extending Evidence
+
+**Source:** Apollo Research publication gap analysis, April 2026
+
+The moderating claim that multi-layer ensemble probes provide black-box robustness depends on whether rotation patterns are architecture-specific or universal. As of April 2026, no cross-model-family probe transfer testing has been published, meaning the architecture-specificity assumption remains empirically untested. The absence of this testing after 14+ months suggests either: (a) cross-family transfer is known to fail internally and not worth publishing, (b) research agendas prioritize within-family deployment robustness, or (c) the experimental setup requires infrastructure not yet built.
--- a/inbox/archive/ai-alignment/2026-04-26-apollo-research-no-cross-model-deception-probe-published.md
+++ b/inbox/archive/ai-alignment/2026-04-26-apollo-research-no-cross-model-deception-probe-published.md
@ -7,9 +7,12 @@ date: 2026-04-26
 domain: ai-alignment
 secondary_domains: []
 format: absence-of-evidence
-status: unprocessed
+status: processed
+processed_by: theseus
+processed_date: 2026-04-26
 priority: medium
 tags: [apollo-research, deception-probe, cross-model-transfer, absence-of-evidence, linear-probes, b4-verification, rotation-patterns]
+extraction_model: "anthropic/claude-sonnet-4.5"
 ---

 ## Content