theseus: extract claims from 2026-04-22-theseus-multilayer-probe-scav-robustness-synthesis
Some checks failed
Mirror PR to Forgejo / mirror (pull_request) Has been cancelled
Some checks failed
Mirror PR to Forgejo / mirror (pull_request) Has been cancelled
- Source: inbox/queue/2026-04-22-theseus-multilayer-probe-scav-robustness-synthesis.md - Domain: ai-alignment - Claims: 0, Entities: 0 - Enrichments: 3 - Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5) Pentagon-Agent: Theseus <PIPELINE>
This commit is contained in:
parent
ff997b0087
commit
f735da7ba3
3 changed files with 21 additions and 0 deletions
|
|
@ -23,3 +23,10 @@ Nordby et al. found that multi-layer ensemble probes improved AUROC by 29 percen
|
||||||
**Source:** Theseus synthesis of Nordby et al. (arXiv 2604.13386, April 2026)
|
**Source:** Theseus synthesis of Nordby et al. (arXiv 2604.13386, April 2026)
|
||||||
|
|
||||||
The 29-78% AUROC improvement occurs because deception-relevant concept directions rotate across layers rather than remaining fixed in the residual stream. Single-layer probes capture only one projection of this rotation, making them brittle. Multi-layer ensembles combine projections from multiple layers, capturing more of the full rotational structure. However, this improvement is measured on clean data without adversarial conditions—adversarial robustness testing is entirely absent from Nordby et al.'s scope.
|
The 29-78% AUROC improvement occurs because deception-relevant concept directions rotate across layers rather than remaining fixed in the residual stream. Single-layer probes capture only one projection of this rotation, making them brittle. Multi-layer ensembles combine projections from multiple layers, capturing more of the full rotational structure. However, this improvement is measured on clean data without adversarial conditions—adversarial robustness testing is entirely absent from Nordby et al.'s scope.
|
||||||
|
|
||||||
|
|
||||||
|
## Extending Evidence
|
||||||
|
|
||||||
|
**Source:** Theseus synthesis of Nordby et al. (arXiv 2604.13386)
|
||||||
|
|
||||||
|
The performance improvement comes from capturing the rotation of deception-relevant concept directions across layers rather than a fixed direction. Single-layer probes capture only one projection of this rotation, making them brittle. Multi-layer ensembles combine projections from multiple layers, capturing more of the full rotational structure. This explains the mechanism behind the AUROC improvement.
|
||||||
|
|
|
||||||
|
|
@ -23,3 +23,10 @@ Xu et al. introduce SCAV (Steering Concept Activation Vectors), which identifies
|
||||||
**Source:** Theseus synthetic analysis combining Nordby et al. and Xu et al. SCAV
|
**Source:** Theseus synthetic analysis combining Nordby et al. and Xu et al. SCAV
|
||||||
|
|
||||||
Multi-layer ensemble probes do not escape the dual-use attack surface identified for single-layer probes. With white-box access, SCAV can be generalized to compute concept directions at each monitored layer and construct a single perturbation suppressing all simultaneously. This is a higher-dimensional optimization requiring more computation and data, but is structurally feasible by the same mechanism. Open-weights models (Llama, Mistral, Falcon) remain fully vulnerable to white-box multi-layer SCAV regardless of ensemble complexity.
|
Multi-layer ensemble probes do not escape the dual-use attack surface identified for single-layer probes. With white-box access, SCAV can be generalized to compute concept directions at each monitored layer and construct a single perturbation suppressing all simultaneously. This is a higher-dimensional optimization requiring more computation and data, but is structurally feasible by the same mechanism. Open-weights models (Llama, Mistral, Falcon) remain fully vulnerable to white-box multi-layer SCAV regardless of ensemble complexity.
|
||||||
|
|
||||||
|
|
||||||
|
## Extending Evidence
|
||||||
|
|
||||||
|
**Source:** Theseus synthesis of Nordby × SCAV (2026-04-22)
|
||||||
|
|
||||||
|
Multi-layer ensemble architectures do not escape the dual-use structure — they shift the attack surface to higher-dimensional optimization. White-box multi-layer SCAV is structurally feasible by generalizing the single-layer attack to suppress concept directions at multiple layers simultaneously. No architectural escape exists for white-box adversaries. Black-box robustness depends on whether rotation patterns (how concept directions change across layers) are universal or model-specific, which is currently untested.
|
||||||
|
|
|
||||||
|
|
@ -24,3 +24,10 @@ Weight-space alignment geometry research (2602.15799) establishes that alignment
|
||||||
**Source:** Theseus synthesis addressing Sessions 29-30 open question
|
**Source:** Theseus synthesis addressing Sessions 29-30 open question
|
||||||
|
|
||||||
The dual-use finding now extends to multi-layer ensemble monitoring with deployment-context qualification: open-weights models face white-box multi-layer SCAV attacks that architectural improvements cannot prevent, while closed-source models may gain genuine black-box protection if concept direction rotation patterns are model-specific rather than universal. The monitoring precision hierarchy holds across all levels, but the severity of dual-use risk depends on whether attackers have white-box or black-box access.
|
The dual-use finding now extends to multi-layer ensemble monitoring with deployment-context qualification: open-weights models face white-box multi-layer SCAV attacks that architectural improvements cannot prevent, while closed-source models may gain genuine black-box protection if concept direction rotation patterns are model-specific rather than universal. The monitoring precision hierarchy holds across all levels, but the severity of dual-use risk depends on whether attackers have white-box or black-box access.
|
||||||
|
|
||||||
|
|
||||||
|
## Extending Evidence
|
||||||
|
|
||||||
|
**Source:** Theseus synthetic analysis (2026-04-22)
|
||||||
|
|
||||||
|
The dual-use finding extends to multi-layer ensemble probes with important scope qualification: open-weights models remain fully vulnerable to white-box multi-layer SCAV attacks (attacker computes concept directions at all monitored layers and constructs simultaneous suppression). Closed-source models may gain genuine black-box protection if rotation patterns are model-specific, but this remains empirically untested. The monitoring precision hierarchy holds across all levels, but attack cost varies by deployment context.
|
||||||
|
|
|
||||||
Loading…
Reference in a new issue