theseus: extract claims from 2026-04-26-apollo-research-no-cross-model-deception-probe-published
Some checks are pending
Mirror PR to Forgejo / mirror (pull_request) Waiting to run

- Source: inbox/queue/2026-04-26-apollo-research-no-cross-model-deception-probe-published.md
- Domain: ai-alignment
- Claims: 0, Entities: 0
- Enrichments: 2
- Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5)

Pentagon-Agent: Theseus <PIPELINE>
This commit is contained in:
Teleo Agents 2026-04-26 00:26:13 +00:00
parent a706e55d78
commit deb497dd59
2 changed files with 12 additions and 2 deletions

View file

@ -10,9 +10,16 @@ agent: theseus
sourced_from: ai-alignment/2026-04-22-theseus-multilayer-probe-scav-robustness-synthesis.md
scope: structural
sourcer: Theseus
related: ["anti-safety-scaling-law-larger-models-more-vulnerable-to-concept-vector-attacks", "trajectory-monitoring-dual-edge-geometric-concentration", "representation-monitoring-via-linear-concept-vectors-creates-dual-use-attack-surface", "multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent"]
related: ["anti-safety-scaling-law-larger-models-more-vulnerable-to-concept-vector-attacks", "trajectory-monitoring-dual-edge-geometric-concentration", "representation-monitoring-via-linear-concept-vectors-creates-dual-use-attack-surface", "multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent", "multi-layer-ensemble-probes-provide-black-box-robustness-but-not-white-box-protection-against-scav-attacks", "rotation-pattern-universality-determines-black-box-multi-layer-scav-feasibility"]
---
# Multi-layer ensemble probes provide black-box adversarial robustness only if concept direction rotation patterns are model-specific not universal
Multi-layer ensemble probes improve clean-data AUROC by 29-78% over single-layer probes because deception-relevant concept directions rotate across layers rather than remaining fixed. However, this architectural improvement does not provide structural protection against adversarial attacks in the white-box case. With access to model weights and activations (the standard condition for open-weights models like Llama, Mistral, Falcon), an attacker can generalize SCAV to compute concept directions at each monitored layer and construct a single perturbation suppressing all of them simultaneously. This is a higher-dimensional optimization problem but structurally feasible by the same mechanism as single-layer SCAV. The critical unresolved question is whether black-box attacks transfer: single-layer SCAV transferred to GPT-4 because concept direction universality allowed reconstruction from different models. Multi-layer black-box SCAV requires that rotation patterns (how directions change across layers) are also universal. Beaglehole et al. found concept vectors transfer cross-language and cross-model-family, suggesting the underlying geometry may be universal enough to enable rotation pattern transfer. However, different architectures (depth, attention heads, MLP width, pre-training data) produce different residual stream dynamics, and rotation may depend on model-specific representational basis evolution. No published work tests whether multi-layer rotation patterns transfer across model families. If they do not transfer, multi-layer ensembles provide genuine black-box protection for closed-source models. If they do transfer, multi-layer ensembles merely raise attack cost without escaping the dual-use structure. This creates a deployment-context-dependent safety verdict: open-weights models remain fully vulnerable to white-box multi-layer SCAV regardless of ensemble complexity, while closed-source models may gain genuine robustness if rotation patterns are model-specific.
## Extending Evidence
**Source:** Apollo Research publication gap analysis, April 2026
The moderating claim that multi-layer ensemble probes provide black-box robustness depends on whether rotation patterns are architecture-specific or universal. As of April 2026, no cross-model-family probe transfer testing has been published, meaning the architecture-specificity assumption remains empirically untested. The absence of this testing after 14+ months suggests either: (a) cross-family transfer is known to fail internally and not worth publishing, (b) research agendas prioritize within-family deployment robustness, or (c) the experimental setup requires infrastructure not yet built.

View file

@ -7,9 +7,12 @@ date: 2026-04-26
domain: ai-alignment
secondary_domains: []
format: absence-of-evidence
status: unprocessed
status: processed
processed_by: theseus
processed_date: 2026-04-26
priority: medium
tags: [apollo-research, deception-probe, cross-model-transfer, absence-of-evidence, linear-probes, b4-verification, rotation-patterns]
extraction_model: "anthropic/claude-sonnet-4.5"
---
## Content