theseus: extract claims from 2026-04-22-theseus-multilayer-probe-scav-robustness-synthesis
Some checks are pending
Mirror PR to Forgejo / mirror (pull_request) Waiting to run

- Source: inbox/queue/2026-04-22-theseus-multilayer-probe-scav-robustness-synthesis.md
- Domain: ai-alignment
- Claims: 0, Entities: 0
- Enrichments: 3
- Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5)

Pentagon-Agent: Theseus <PIPELINE>
This commit is contained in:
Teleo Agents 2026-04-22 05:28:47 +00:00
parent 6079834e86
commit a832cb99c0
3 changed files with 21 additions and 0 deletions

View file

@ -37,3 +37,10 @@ Nordby et al.'s clean-data accuracy improvements do not translate to adversarial
**Source:** Theseus adversarial robustness analysis
The 29-78% AUROC improvement is a clean-data accuracy result that does not translate to adversarial robustness. Nordby et al. contains no adversarial evaluation. White-box multi-layer SCAV is structurally feasible against these ensembles in open-weights models. The accuracy improvement is real but scoped to non-adversarial monitoring contexts.
## Extending Evidence
**Source:** Theseus synthetic analysis of white-box SCAV generalization
The 29-78% accuracy improvement applies to clean-data monitoring but does not translate to adversarial robustness in open-weights deployments. White-box attackers can generalize SCAV to multi-layer ensembles by computing concept directions at each monitored layer and constructing perturbations that suppress all simultaneously. The improvement is real but scope-limited to non-adversarial or black-box adversarial contexts.

View file

@ -24,3 +24,10 @@ The method characterizes alignment-relevant properties as geometric patterns in
**Source:** Theseus synthetic analysis (2026-04-22)
Trajectory geometry monitoring does create adversarial attack surfaces through multi-layer SCAV generalization. The claim that geometric approaches avoid attack surface creation is incorrect for white-box adversaries with access to model weights. May need qualification to 'black-box adversaries only, contingent on rotation pattern specificity.'
## Challenging Evidence
**Source:** Theseus synthetic analysis of SCAV generalization to multi-layer ensembles
Multi-layer ensemble analysis shows trajectory geometry monitoring DOES create attack surfaces in white-box settings. While multi-layer ensembles are harder to exploit than single-layer probes, white-box multi-layer SCAV is structurally feasible through simultaneous suppression of concept directions at all monitored layers. The claim that trajectory geometry avoids attack surfaces may need qualification to 'reduces attack surface in black-box settings if rotation patterns are model-specific.'

View file

@ -38,3 +38,10 @@ The dual-use vulnerability extends to multi-layer ensemble monitoring, not just
**Source:** Theseus synthetic analysis of Nordby et al. (arXiv 2604.13386, April 2026)
Multi-layer ensemble probes (Nordby et al. 2026) improve clean monitoring accuracy 29-78% but provide no structural protection against white-box adversaries in open-weights models. White-box multi-layer SCAV can compute concept directions at each monitored layer and construct a single perturbation suppressing all simultaneously. The dual-use finding extends to all monitoring precision levels with scope qualification: open-weights models face structural vulnerability regardless of ensemble complexity; closed-source models may gain genuine black-box protection if rotation patterns are model-specific (untested).
## Extending Evidence
**Source:** Theseus synthetic analysis of Nordby et al. + SCAV literature
Multi-layer ensemble probes, despite 29-78% accuracy improvements over single-layer probes, remain structurally vulnerable to white-box SCAV attacks through multi-layer concept direction suppression. The dual-use finding extends to all monitoring precision levels, with deployment context (open-weights vs. closed-source, white-box vs. black-box) determining severity rather than architectural sophistication eliminating the problem.