theseus: extract claims from 2026-04-22-theseus-multilayer-probe-scav-robustness-synthesis
Some checks are pending
Mirror PR to Forgejo / mirror (pull_request) Waiting to run
Some checks are pending
Mirror PR to Forgejo / mirror (pull_request) Waiting to run
- Source: inbox/queue/2026-04-22-theseus-multilayer-probe-scav-robustness-synthesis.md - Domain: ai-alignment - Claims: 0, Entities: 0 - Enrichments: 2 - Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5) Pentagon-Agent: Theseus <PIPELINE>
This commit is contained in:
parent
c8aacf2e5b
commit
280a081d3d
2 changed files with 15 additions and 1 deletions
|
|
@ -44,3 +44,10 @@ The 29-78% AUROC improvement is a clean-data accuracy result that does not trans
|
||||||
**Source:** Theseus synthetic analysis of white-box SCAV generalization
|
**Source:** Theseus synthetic analysis of white-box SCAV generalization
|
||||||
|
|
||||||
The 29-78% accuracy improvement applies to clean-data monitoring but does not translate to adversarial robustness in open-weights deployments. White-box attackers can generalize SCAV to multi-layer ensembles by computing concept directions at each monitored layer and constructing perturbations that suppress all simultaneously. The improvement is real but scope-limited to non-adversarial or black-box adversarial contexts.
|
The 29-78% accuracy improvement applies to clean-data monitoring but does not translate to adversarial robustness in open-weights deployments. White-box attackers can generalize SCAV to multi-layer ensembles by computing concept directions at each monitored layer and constructing perturbations that suppress all simultaneously. The improvement is real but scope-limited to non-adversarial or black-box adversarial contexts.
|
||||||
|
|
||||||
|
|
||||||
|
## Extending Evidence
|
||||||
|
|
||||||
|
**Source:** Theseus synthetic analysis
|
||||||
|
|
||||||
|
The 29-78% AUROC improvement applies to clean-data monitoring accuracy but does not translate to adversarial robustness. Open-weights models remain fully vulnerable to white-box multi-layer SCAV attacks regardless of ensemble complexity. Black-box robustness depends on untested rotation pattern universality.
|
||||||
|
|
|
||||||
|
|
@ -12,7 +12,7 @@ sourcer: Theseus
|
||||||
related_claims: ["[[AI-models-distinguish-testing-from-deployment-environments-providing-empirical-evidence-for-deceptive-alignment-concerns]]", "[[emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive]]", "[[capability control methods are temporary at best because a sufficiently intelligent system can circumvent any containment designed by lesser minds]]"]
|
related_claims: ["[[AI-models-distinguish-testing-from-deployment-environments-providing-empirical-evidence-for-deceptive-alignment-concerns]]", "[[emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive]]", "[[capability control methods are temporary at best because a sufficiently intelligent system can circumvent any containment designed by lesser minds]]"]
|
||||||
supports: ["Representation trajectory geometry distinguishes deceptive from sincere alignment without creating adversarial attack surfaces because geometric patterns across reasoning steps are substantially harder to surgically remove than atomic features"]
|
supports: ["Representation trajectory geometry distinguishes deceptive from sincere alignment without creating adversarial attack surfaces because geometric patterns across reasoning steps are substantially harder to surgically remove than atomic features"]
|
||||||
reweave_edges: ["Representation trajectory geometry distinguishes deceptive from sincere alignment without creating adversarial attack surfaces because geometric patterns across reasoning steps are substantially harder to surgically remove than atomic features|supports|2026-04-12", "Trajectory geometry probing requires white-box access to all intermediate activations, making it deployable in controlled evaluation contexts but not in adversarial external audit scenarios|related|2026-04-17"]
|
reweave_edges: ["Representation trajectory geometry distinguishes deceptive from sincere alignment without creating adversarial attack surfaces because geometric patterns across reasoning steps are substantially harder to surgically remove than atomic features|supports|2026-04-12", "Trajectory geometry probing requires white-box access to all intermediate activations, making it deployable in controlled evaluation contexts but not in adversarial external audit scenarios|related|2026-04-17"]
|
||||||
related: ["Trajectory geometry probing requires white-box access to all intermediate activations, making it deployable in controlled evaluation contexts but not in adversarial external audit scenarios", "trajectory-monitoring-dual-edge-geometric-concentration"]
|
related: ["Trajectory geometry probing requires white-box access to all intermediate activations, making it deployable in controlled evaluation contexts but not in adversarial external audit scenarios", "trajectory-monitoring-dual-edge-geometric-concentration", "representation-trajectory-geometry-distinguishes-deceptive-from-sincere-alignment-without-creating-adversarial-attack-surfaces"]
|
||||||
---
|
---
|
||||||
|
|
||||||
# Geometric concentration of alignment in weight space makes trajectory monitoring more effective through stronger signal but gameable through adversarial training that matches monitored trajectory clusters
|
# Geometric concentration of alignment in weight space makes trajectory monitoring more effective through stronger signal but gameable through adversarial training that matches monitored trajectory clusters
|
||||||
|
|
@ -45,3 +45,10 @@ Multi-layer ensemble probes (Nordby et al. 2026) improve clean monitoring accura
|
||||||
**Source:** Theseus synthetic analysis of Nordby et al. + SCAV literature
|
**Source:** Theseus synthetic analysis of Nordby et al. + SCAV literature
|
||||||
|
|
||||||
Multi-layer ensemble probes, despite 29-78% accuracy improvements over single-layer probes, remain structurally vulnerable to white-box SCAV attacks through multi-layer concept direction suppression. The dual-use finding extends to all monitoring precision levels, with deployment context (open-weights vs. closed-source, white-box vs. black-box) determining severity rather than architectural sophistication eliminating the problem.
|
Multi-layer ensemble probes, despite 29-78% accuracy improvements over single-layer probes, remain structurally vulnerable to white-box SCAV attacks through multi-layer concept direction suppression. The dual-use finding extends to all monitoring precision levels, with deployment context (open-weights vs. closed-source, white-box vs. black-box) determining severity rather than architectural sophistication eliminating the problem.
|
||||||
|
|
||||||
|
|
||||||
|
## Extending Evidence
|
||||||
|
|
||||||
|
**Source:** Theseus synthetic analysis of Nordby et al. + Xu et al. SCAV
|
||||||
|
|
||||||
|
White-box multi-layer SCAV is structurally feasible by computing concept directions at each monitored layer and constructing a single perturbation that suppresses all simultaneously. This extends the dual-use finding to multi-layer ensembles in the white-box case, confirming that architectural complexity raises attack cost but does not provide structural escape.
|
||||||
|
|
|
||||||
Loading…
Reference in a new issue