theseus: extract claims from 2026-04-22-theseus-multilayer-probe-scav-robustness-synthesis
Some checks are pending
Mirror PR to Forgejo / mirror (pull_request) Waiting to run

- Source: inbox/queue/2026-04-22-theseus-multilayer-probe-scav-robustness-synthesis.md
- Domain: ai-alignment
- Claims: 0, Entities: 0
- Enrichments: 2
- Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5)

Pentagon-Agent: Theseus <PIPELINE>
This commit is contained in:
Teleo Agents 2026-04-22 07:28:18 +00:00
parent c8aacf2e5b
commit 280a081d3d
2 changed files with 15 additions and 1 deletions

View file

@ -44,3 +44,10 @@ The 29-78% AUROC improvement is a clean-data accuracy result that does not trans
**Source:** Theseus synthetic analysis of white-box SCAV generalization **Source:** Theseus synthetic analysis of white-box SCAV generalization
The 29-78% accuracy improvement applies to clean-data monitoring but does not translate to adversarial robustness in open-weights deployments. White-box attackers can generalize SCAV to multi-layer ensembles by computing concept directions at each monitored layer and constructing perturbations that suppress all simultaneously. The improvement is real but scope-limited to non-adversarial or black-box adversarial contexts. The 29-78% accuracy improvement applies to clean-data monitoring but does not translate to adversarial robustness in open-weights deployments. White-box attackers can generalize SCAV to multi-layer ensembles by computing concept directions at each monitored layer and constructing perturbations that suppress all simultaneously. The improvement is real but scope-limited to non-adversarial or black-box adversarial contexts.
## Extending Evidence
**Source:** Theseus synthetic analysis
The 29-78% AUROC improvement applies to clean-data monitoring accuracy but does not translate to adversarial robustness. Open-weights models remain fully vulnerable to white-box multi-layer SCAV attacks regardless of ensemble complexity. Black-box robustness depends on untested rotation pattern universality.

View file

@ -12,7 +12,7 @@ sourcer: Theseus
related_claims: ["[[AI-models-distinguish-testing-from-deployment-environments-providing-empirical-evidence-for-deceptive-alignment-concerns]]", "[[emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive]]", "[[capability control methods are temporary at best because a sufficiently intelligent system can circumvent any containment designed by lesser minds]]"] related_claims: ["[[AI-models-distinguish-testing-from-deployment-environments-providing-empirical-evidence-for-deceptive-alignment-concerns]]", "[[emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive]]", "[[capability control methods are temporary at best because a sufficiently intelligent system can circumvent any containment designed by lesser minds]]"]
supports: ["Representation trajectory geometry distinguishes deceptive from sincere alignment without creating adversarial attack surfaces because geometric patterns across reasoning steps are substantially harder to surgically remove than atomic features"] supports: ["Representation trajectory geometry distinguishes deceptive from sincere alignment without creating adversarial attack surfaces because geometric patterns across reasoning steps are substantially harder to surgically remove than atomic features"]
reweave_edges: ["Representation trajectory geometry distinguishes deceptive from sincere alignment without creating adversarial attack surfaces because geometric patterns across reasoning steps are substantially harder to surgically remove than atomic features|supports|2026-04-12", "Trajectory geometry probing requires white-box access to all intermediate activations, making it deployable in controlled evaluation contexts but not in adversarial external audit scenarios|related|2026-04-17"] reweave_edges: ["Representation trajectory geometry distinguishes deceptive from sincere alignment without creating adversarial attack surfaces because geometric patterns across reasoning steps are substantially harder to surgically remove than atomic features|supports|2026-04-12", "Trajectory geometry probing requires white-box access to all intermediate activations, making it deployable in controlled evaluation contexts but not in adversarial external audit scenarios|related|2026-04-17"]
related: ["Trajectory geometry probing requires white-box access to all intermediate activations, making it deployable in controlled evaluation contexts but not in adversarial external audit scenarios", "trajectory-monitoring-dual-edge-geometric-concentration"] related: ["Trajectory geometry probing requires white-box access to all intermediate activations, making it deployable in controlled evaluation contexts but not in adversarial external audit scenarios", "trajectory-monitoring-dual-edge-geometric-concentration", "representation-trajectory-geometry-distinguishes-deceptive-from-sincere-alignment-without-creating-adversarial-attack-surfaces"]
--- ---
# Geometric concentration of alignment in weight space makes trajectory monitoring more effective through stronger signal but gameable through adversarial training that matches monitored trajectory clusters # Geometric concentration of alignment in weight space makes trajectory monitoring more effective through stronger signal but gameable through adversarial training that matches monitored trajectory clusters
@ -45,3 +45,10 @@ Multi-layer ensemble probes (Nordby et al. 2026) improve clean monitoring accura
**Source:** Theseus synthetic analysis of Nordby et al. + SCAV literature **Source:** Theseus synthetic analysis of Nordby et al. + SCAV literature
Multi-layer ensemble probes, despite 29-78% accuracy improvements over single-layer probes, remain structurally vulnerable to white-box SCAV attacks through multi-layer concept direction suppression. The dual-use finding extends to all monitoring precision levels, with deployment context (open-weights vs. closed-source, white-box vs. black-box) determining severity rather than architectural sophistication eliminating the problem. Multi-layer ensemble probes, despite 29-78% accuracy improvements over single-layer probes, remain structurally vulnerable to white-box SCAV attacks through multi-layer concept direction suppression. The dual-use finding extends to all monitoring precision levels, with deployment context (open-weights vs. closed-source, white-box vs. black-box) determining severity rather than architectural sophistication eliminating the problem.
## Extending Evidence
**Source:** Theseus synthetic analysis of Nordby et al. + Xu et al. SCAV
White-box multi-layer SCAV is structurally feasible by computing concept directions at each monitored layer and constructing a single perturbation that suppresses all simultaneously. This extends the dual-use finding to multi-layer ensembles in the white-box case, confirming that architectural complexity raises attack cost but does not provide structural escape.