theseus: extract claims from 2026-04-22-theseus-multilayer-probe-scav-robustness-synthesis
Some checks failed
Mirror PR to Forgejo / mirror (pull_request) Has been cancelled

- Source: inbox/queue/2026-04-22-theseus-multilayer-probe-scav-robustness-synthesis.md
- Domain: ai-alignment
- Claims: 0, Entities: 0
- Enrichments: 3
- Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5)

Pentagon-Agent: Theseus <PIPELINE>
This commit is contained in:
Teleo Agents 2026-04-22 02:43:50 +00:00
parent 5329fe5f3e
commit 38b3437ea0
3 changed files with 21 additions and 0 deletions

View file

@ -30,3 +30,10 @@ The 29-78% AUROC improvement occurs because deception-relevant concept direction
**Source:** Theseus synthetic analysis (2026-04-22) **Source:** Theseus synthetic analysis (2026-04-22)
Nordby et al.'s clean-data accuracy improvements do not translate to adversarial robustness. White-box multi-layer SCAV can suppress concept directions at all monitored layers simultaneously through higher-dimensional optimization. Open-weights models gain no structural protection from multi-layer ensembles against adversarially-sophisticated attackers with model access. Nordby et al.'s clean-data accuracy improvements do not translate to adversarial robustness. White-box multi-layer SCAV can suppress concept directions at all monitored layers simultaneously through higher-dimensional optimization. Open-weights models gain no structural protection from multi-layer ensembles against adversarially-sophisticated attackers with model access.
## Extending Evidence
**Source:** Theseus synthetic analysis
The 29-78% AUROC improvement is a clean-data accuracy result. Nordby et al. did not test adversarial robustness. Under white-box adversarial conditions (open-weights models), multi-layer ensembles provide no structural protection against SCAV-generalization attacks. The improvement applies only to non-adversarial monitoring contexts.

View file

@ -24,3 +24,10 @@ The method characterizes alignment-relevant properties as geometric patterns in
**Source:** Theseus synthetic analysis (2026-04-22) **Source:** Theseus synthetic analysis (2026-04-22)
Trajectory geometry monitoring does create adversarial attack surfaces through multi-layer SCAV generalization. The claim that geometric approaches avoid attack surface creation is incorrect for white-box adversaries with access to model weights. May need qualification to 'black-box adversaries only, contingent on rotation pattern specificity.' Trajectory geometry monitoring does create adversarial attack surfaces through multi-layer SCAV generalization. The claim that geometric approaches avoid attack surface creation is incorrect for white-box adversaries with access to model weights. May need qualification to 'black-box adversaries only, contingent on rotation pattern specificity.'
## Challenging Evidence
**Source:** Theseus synthetic analysis of multi-layer SCAV feasibility
Multi-layer trajectory geometry monitoring (ensemble probes across layers) does create adversarial attack surface in white-box settings. White-box multi-layer SCAV can suppress concept directions at all monitored layers simultaneously. The claim that trajectory geometry avoids attack surfaces may need qualification: it may hold only for black-box adversaries if rotation patterns are model-specific, but not for white-box adversaries with access to model weights.

View file

@ -31,3 +31,10 @@ The dual-use finding now extends to multi-layer ensemble monitoring with deploym
**Source:** Theseus synthetic analysis (2026-04-22) **Source:** Theseus synthetic analysis (2026-04-22)
The dual-use vulnerability extends to multi-layer ensemble monitoring, not just single-layer probes. However, the severity is deployment-context-dependent: open-weights models (white-box adversaries) remain fully vulnerable, while closed-source models (black-box adversaries) may gain protection if rotation patterns are model-specific (untested assumption). The dual-use vulnerability extends to multi-layer ensemble monitoring, not just single-layer probes. However, the severity is deployment-context-dependent: open-weights models (white-box adversaries) remain fully vulnerable, while closed-source models (black-box adversaries) may gain protection if rotation patterns are model-specific (untested assumption).
## Extending Evidence
**Source:** Theseus synthetic analysis of Nordby et al. + SCAV
Multi-layer ensemble probes do not escape the dual-use structure in white-box settings (open-weights models). White-box multi-layer SCAV is feasible by computing concept directions at each monitored layer and constructing a single perturbation that suppresses all simultaneously. The monitoring precision hierarchy holds: each level is structurally defeatable given sufficient attacker capability. Multi-layer ensembles raise attack cost but do not escape the dual-use structure for open-weights deployments.