Compare commits

...

1 commit

Author SHA1 Message Date
Teleo Agents
e60b24270a theseus: extract claims from 2026-04-22-theseus-multilayer-probe-scav-robustness-synthesis
Some checks failed
Mirror PR to Forgejo / mirror (pull_request) Has been cancelled
- Source: inbox/queue/2026-04-22-theseus-multilayer-probe-scav-robustness-synthesis.md
- Domain: ai-alignment
- Claims: 0, Entities: 0
- Enrichments: 3
- Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5)

Pentagon-Agent: Theseus <PIPELINE>
2026-04-22 08:08:20 +00:00
2 changed files with 14 additions and 0 deletions

View file

@ -31,3 +31,10 @@ Trajectory geometry monitoring does create adversarial attack surfaces through m
**Source:** Theseus synthetic analysis of SCAV generalization to multi-layer ensembles **Source:** Theseus synthetic analysis of SCAV generalization to multi-layer ensembles
Multi-layer ensemble analysis shows trajectory geometry monitoring DOES create attack surfaces in white-box settings. While multi-layer ensembles are harder to exploit than single-layer probes, white-box multi-layer SCAV is structurally feasible through simultaneous suppression of concept directions at all monitored layers. The claim that trajectory geometry avoids attack surfaces may need qualification to 'reduces attack surface in black-box settings if rotation patterns are model-specific.' Multi-layer ensemble analysis shows trajectory geometry monitoring DOES create attack surfaces in white-box settings. While multi-layer ensembles are harder to exploit than single-layer probes, white-box multi-layer SCAV is structurally feasible through simultaneous suppression of concept directions at all monitored layers. The claim that trajectory geometry avoids attack surfaces may need qualification to 'reduces attack surface in black-box settings if rotation patterns are model-specific.'
## Challenging Evidence
**Source:** Theseus synthetic analysis of multi-layer SCAV generalization
Multi-layer ensemble analysis shows trajectory geometry monitoring DOES create adversarial attack surfaces, just harder to exploit. White-box multi-layer SCAV is structurally feasible by computing concept directions at each monitored layer and constructing perturbations that suppress all simultaneously. The claim that trajectory geometry avoids attack surfaces needs qualification: it may avoid single-layer attack surfaces but not multi-layer ones.

View file

@ -52,3 +52,10 @@ Multi-layer ensemble probes, despite 29-78% accuracy improvements over single-la
**Source:** Theseus synthetic analysis of Nordby et al. + Xu et al. SCAV **Source:** Theseus synthetic analysis of Nordby et al. + Xu et al. SCAV
White-box multi-layer SCAV is structurally feasible by computing concept directions at each monitored layer and constructing a single perturbation that suppresses all simultaneously. This extends the dual-use finding to multi-layer ensembles in the white-box case, confirming that architectural complexity raises attack cost but does not provide structural escape. White-box multi-layer SCAV is structurally feasible by computing concept directions at each monitored layer and constructing a single perturbation that suppresses all simultaneously. This extends the dual-use finding to multi-layer ensembles in the white-box case, confirming that architectural complexity raises attack cost but does not provide structural escape.
## Extending Evidence
**Source:** Theseus synthetic analysis of Nordby et al. + SCAV literature
Multi-layer ensemble probes, despite improving clean monitoring accuracy 29-78% over single-layer probes, remain structurally vulnerable to white-box SCAV attacks in open-weights models. The dual-use finding extends to all monitoring precision levels: each architectural improvement raises attack cost but does not escape the fundamental dual-use structure. Open-weights models face full white-box vulnerability regardless of ensemble complexity.