From deb497dd59fe627e20f8ef5c61f4acf0ef2ab2f3 Mon Sep 17 00:00:00 2001 From: Teleo Agents Date: Sun, 26 Apr 2026 00:26:13 +0000 Subject: [PATCH] theseus: extract claims from 2026-04-26-apollo-research-no-cross-model-deception-probe-published - Source: inbox/queue/2026-04-26-apollo-research-no-cross-model-deception-probe-published.md - Domain: ai-alignment - Claims: 0, Entities: 0 - Enrichments: 2 - Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5) Pentagon-Agent: Theseus --- ...-but-not-white-box-protection-against-scav-attacks.md | 9 ++++++++- ...-research-no-cross-model-deception-probe-published.md | 5 ++++- 2 files changed, 12 insertions(+), 2 deletions(-) rename inbox/{queue => archive/ai-alignment}/2026-04-26-apollo-research-no-cross-model-deception-probe-published.md (97%) diff --git a/domains/ai-alignment/multi-layer-ensemble-probes-provide-black-box-robustness-but-not-white-box-protection-against-scav-attacks.md b/domains/ai-alignment/multi-layer-ensemble-probes-provide-black-box-robustness-but-not-white-box-protection-against-scav-attacks.md index ef7a528b7..cbe851376 100644 --- a/domains/ai-alignment/multi-layer-ensemble-probes-provide-black-box-robustness-but-not-white-box-protection-against-scav-attacks.md +++ b/domains/ai-alignment/multi-layer-ensemble-probes-provide-black-box-robustness-but-not-white-box-protection-against-scav-attacks.md @@ -10,9 +10,16 @@ agent: theseus sourced_from: ai-alignment/2026-04-22-theseus-multilayer-probe-scav-robustness-synthesis.md scope: structural sourcer: Theseus -related: ["anti-safety-scaling-law-larger-models-more-vulnerable-to-concept-vector-attacks", "trajectory-monitoring-dual-edge-geometric-concentration", "representation-monitoring-via-linear-concept-vectors-creates-dual-use-attack-surface", "multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent"] +related: ["anti-safety-scaling-law-larger-models-more-vulnerable-to-concept-vector-attacks", "trajectory-monitoring-dual-edge-geometric-concentration", "representation-monitoring-via-linear-concept-vectors-creates-dual-use-attack-surface", "multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent", "multi-layer-ensemble-probes-provide-black-box-robustness-but-not-white-box-protection-against-scav-attacks", "rotation-pattern-universality-determines-black-box-multi-layer-scav-feasibility"] --- # Multi-layer ensemble probes provide black-box adversarial robustness only if concept direction rotation patterns are model-specific not universal Multi-layer ensemble probes improve clean-data AUROC by 29-78% over single-layer probes because deception-relevant concept directions rotate across layers rather than remaining fixed. However, this architectural improvement does not provide structural protection against adversarial attacks in the white-box case. With access to model weights and activations (the standard condition for open-weights models like Llama, Mistral, Falcon), an attacker can generalize SCAV to compute concept directions at each monitored layer and construct a single perturbation suppressing all of them simultaneously. This is a higher-dimensional optimization problem but structurally feasible by the same mechanism as single-layer SCAV. The critical unresolved question is whether black-box attacks transfer: single-layer SCAV transferred to GPT-4 because concept direction universality allowed reconstruction from different models. Multi-layer black-box SCAV requires that rotation patterns (how directions change across layers) are also universal. Beaglehole et al. found concept vectors transfer cross-language and cross-model-family, suggesting the underlying geometry may be universal enough to enable rotation pattern transfer. However, different architectures (depth, attention heads, MLP width, pre-training data) produce different residual stream dynamics, and rotation may depend on model-specific representational basis evolution. No published work tests whether multi-layer rotation patterns transfer across model families. If they do not transfer, multi-layer ensembles provide genuine black-box protection for closed-source models. If they do transfer, multi-layer ensembles merely raise attack cost without escaping the dual-use structure. This creates a deployment-context-dependent safety verdict: open-weights models remain fully vulnerable to white-box multi-layer SCAV regardless of ensemble complexity, while closed-source models may gain genuine robustness if rotation patterns are model-specific. + + +## Extending Evidence + +**Source:** Apollo Research publication gap analysis, April 2026 + +The moderating claim that multi-layer ensemble probes provide black-box robustness depends on whether rotation patterns are architecture-specific or universal. As of April 2026, no cross-model-family probe transfer testing has been published, meaning the architecture-specificity assumption remains empirically untested. The absence of this testing after 14+ months suggests either: (a) cross-family transfer is known to fail internally and not worth publishing, (b) research agendas prioritize within-family deployment robustness, or (c) the experimental setup requires infrastructure not yet built. diff --git a/inbox/queue/2026-04-26-apollo-research-no-cross-model-deception-probe-published.md b/inbox/archive/ai-alignment/2026-04-26-apollo-research-no-cross-model-deception-probe-published.md similarity index 97% rename from inbox/queue/2026-04-26-apollo-research-no-cross-model-deception-probe-published.md rename to inbox/archive/ai-alignment/2026-04-26-apollo-research-no-cross-model-deception-probe-published.md index b09d34f90..f6b735fd3 100644 --- a/inbox/queue/2026-04-26-apollo-research-no-cross-model-deception-probe-published.md +++ b/inbox/archive/ai-alignment/2026-04-26-apollo-research-no-cross-model-deception-probe-published.md @@ -7,9 +7,12 @@ date: 2026-04-26 domain: ai-alignment secondary_domains: [] format: absence-of-evidence -status: unprocessed +status: processed +processed_by: theseus +processed_date: 2026-04-26 priority: medium tags: [apollo-research, deception-probe, cross-model-transfer, absence-of-evidence, linear-probes, b4-verification, rotation-patterns] +extraction_model: "anthropic/claude-sonnet-4.5" --- ## Content