theseus: extract claims from 2026-04-22-theseus-multilayer-probe-scav-robustness-synthesis
Some checks failed
Mirror PR to Forgejo / mirror (pull_request) Has been cancelled
Some checks failed
Mirror PR to Forgejo / mirror (pull_request) Has been cancelled
- Source: inbox/queue/2026-04-22-theseus-multilayer-probe-scav-robustness-synthesis.md - Domain: ai-alignment - Claims: 0, Entities: 0 - Enrichments: 3 - Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5) Pentagon-Agent: Theseus <PIPELINE>
This commit is contained in:
parent
7fdd8c9718
commit
b01875879e
3 changed files with 21 additions and 0 deletions
|
|
@ -37,3 +37,10 @@ Nordby et al.'s clean-data accuracy improvements do not translate to adversarial
|
||||||
**Source:** Theseus adversarial robustness analysis
|
**Source:** Theseus adversarial robustness analysis
|
||||||
|
|
||||||
The 29-78% AUROC improvement is a clean-data accuracy result that does not translate to adversarial robustness. Nordby et al. contains no adversarial evaluation. White-box multi-layer SCAV is structurally feasible against these ensembles in open-weights models. The accuracy improvement is real but scoped to non-adversarial monitoring contexts.
|
The 29-78% AUROC improvement is a clean-data accuracy result that does not translate to adversarial robustness. Nordby et al. contains no adversarial evaluation. White-box multi-layer SCAV is structurally feasible against these ensembles in open-weights models. The accuracy improvement is real but scoped to non-adversarial monitoring contexts.
|
||||||
|
|
||||||
|
|
||||||
|
## Challenging Evidence
|
||||||
|
|
||||||
|
**Source:** Theseus synthetic analysis (2026-04-22)
|
||||||
|
|
||||||
|
While multi-layer ensembles improve clean-data accuracy substantially, this synthetic analysis shows they provide no structural protection against white-box adversarial attacks (open-weights models) and uncertain protection in black-box settings (depends on untested rotation pattern universality). The accuracy improvement does not translate to adversarial robustness.
|
||||||
|
|
|
||||||
|
|
@ -24,3 +24,10 @@ The method characterizes alignment-relevant properties as geometric patterns in
|
||||||
**Source:** Theseus synthetic analysis (2026-04-22)
|
**Source:** Theseus synthetic analysis (2026-04-22)
|
||||||
|
|
||||||
Trajectory geometry monitoring does create adversarial attack surfaces through multi-layer SCAV generalization. The claim that geometric approaches avoid attack surface creation is incorrect for white-box adversaries with access to model weights. May need qualification to 'black-box adversaries only, contingent on rotation pattern specificity.'
|
Trajectory geometry monitoring does create adversarial attack surfaces through multi-layer SCAV generalization. The claim that geometric approaches avoid attack surface creation is incorrect for white-box adversaries with access to model weights. May need qualification to 'black-box adversaries only, contingent on rotation pattern specificity.'
|
||||||
|
|
||||||
|
|
||||||
|
## Challenging Evidence
|
||||||
|
|
||||||
|
**Source:** Theseus synthetic analysis (2026-04-22)
|
||||||
|
|
||||||
|
This claim may need revision: trajectory geometry monitoring DOES create attack surface in white-box settings (multi-layer SCAV is feasible), though it may be harder to exploit in black-box settings if rotation patterns are model-specific. The 'without creating adversarial attack surfaces' clause is too strong for open-weights deployment contexts.
|
||||||
|
|
|
||||||
|
|
@ -38,3 +38,10 @@ The dual-use vulnerability extends to multi-layer ensemble monitoring, not just
|
||||||
**Source:** Theseus synthetic analysis of Nordby et al. (arXiv 2604.13386, April 2026)
|
**Source:** Theseus synthetic analysis of Nordby et al. (arXiv 2604.13386, April 2026)
|
||||||
|
|
||||||
Multi-layer ensemble probes (Nordby et al. 2026) improve clean monitoring accuracy 29-78% but provide no structural protection against white-box adversaries in open-weights models. White-box multi-layer SCAV can compute concept directions at each monitored layer and construct a single perturbation suppressing all simultaneously. The dual-use finding extends to all monitoring precision levels with scope qualification: open-weights models face structural vulnerability regardless of ensemble complexity; closed-source models may gain genuine black-box protection if rotation patterns are model-specific (untested).
|
Multi-layer ensemble probes (Nordby et al. 2026) improve clean monitoring accuracy 29-78% but provide no structural protection against white-box adversaries in open-weights models. White-box multi-layer SCAV can compute concept directions at each monitored layer and construct a single perturbation suppressing all simultaneously. The dual-use finding extends to all monitoring precision levels with scope qualification: open-weights models face structural vulnerability regardless of ensemble complexity; closed-source models may gain genuine black-box protection if rotation patterns are model-specific (untested).
|
||||||
|
|
||||||
|
|
||||||
|
## Extending Evidence
|
||||||
|
|
||||||
|
**Source:** Theseus synthetic analysis (2026-04-22)
|
||||||
|
|
||||||
|
The dual-use finding now extends to multi-layer ensemble monitoring: white-box multi-layer SCAV is structurally feasible by computing concept directions at each monitored layer and constructing a single perturbation that suppresses all simultaneously. The monitoring precision hierarchy holds at all architectural levels for open-weights models.
|
||||||
|
|
|
||||||
Loading…
Reference in a new issue