diff --git a/domains/ai-alignment/multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent.md b/domains/ai-alignment/multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent.md index 0179a0dc0..caa5bbd89 100644 --- a/domains/ai-alignment/multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent.md +++ b/domains/ai-alignment/multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent.md @@ -30,3 +30,10 @@ The 29-78% AUROC improvement occurs because deception-relevant concept direction **Source:** Theseus synthetic analysis (2026-04-22) Nordby et al.'s clean-data accuracy improvements do not translate to adversarial robustness. White-box multi-layer SCAV can suppress concept directions at all monitored layers simultaneously through higher-dimensional optimization. Open-weights models gain no structural protection from multi-layer ensembles against adversarially-sophisticated attackers with model access. + + +## Challenging Evidence + +**Source:** Theseus adversarial robustness analysis + +The 29-78% AUROC improvement is a clean-data accuracy result that does not translate to adversarial robustness. Nordby et al. contains no adversarial evaluation. White-box multi-layer SCAV is structurally feasible against these ensembles in open-weights models. The accuracy improvement is real but scoped to non-adversarial monitoring contexts. diff --git a/domains/ai-alignment/representation-monitoring-via-linear-concept-vectors-creates-dual-use-attack-surface.md b/domains/ai-alignment/representation-monitoring-via-linear-concept-vectors-creates-dual-use-attack-surface.md index 1333462ca..a2daade9b 100644 --- a/domains/ai-alignment/representation-monitoring-via-linear-concept-vectors-creates-dual-use-attack-surface.md +++ b/domains/ai-alignment/representation-monitoring-via-linear-concept-vectors-creates-dual-use-attack-surface.md @@ -30,3 +30,10 @@ Multi-layer ensemble probes do not escape the dual-use attack surface identified **Source:** Theseus synthetic analysis (2026-04-22) Multi-layer ensemble architectures do not eliminate the fundamental attack surface in white-box settings. White-box multi-layer SCAV generalizes the single-layer attack by computing concept directions at each monitored layer and constructing perturbations that suppress all simultaneously. The attack cost increases but the structural vulnerability remains. + + +## Extending Evidence + +**Source:** Theseus synthetic analysis of Nordby et al. × SCAV + +Multi-layer ensemble monitoring does not eliminate the dual-use attack surface, only shifts it from single-layer to multi-layer SCAV. With white-box access, attackers can generalize SCAV to suppress concept directions at all monitored layers simultaneously through higher-dimensional optimization. Open-weights models remain fully vulnerable. Black-box robustness depends on untested rotation pattern universality question. diff --git a/domains/ai-alignment/trajectory-monitoring-dual-edge-geometric-concentration.md b/domains/ai-alignment/trajectory-monitoring-dual-edge-geometric-concentration.md index 4d8f8c144..bc3c8682a 100644 --- a/domains/ai-alignment/trajectory-monitoring-dual-edge-geometric-concentration.md +++ b/domains/ai-alignment/trajectory-monitoring-dual-edge-geometric-concentration.md @@ -31,3 +31,10 @@ The dual-use finding now extends to multi-layer ensemble monitoring with deploym **Source:** Theseus synthetic analysis (2026-04-22) The dual-use vulnerability extends to multi-layer ensemble monitoring, not just single-layer probes. However, the severity is deployment-context-dependent: open-weights models (white-box adversaries) remain fully vulnerable, while closed-source models (black-box adversaries) may gain protection if rotation patterns are model-specific (untested assumption). + + +## Extending Evidence + +**Source:** Theseus synthetic analysis of Nordby et al. (arXiv 2604.13386, April 2026) + +Multi-layer ensemble probes (Nordby et al. 2026) improve clean monitoring accuracy 29-78% but provide no structural protection against white-box adversaries in open-weights models. White-box multi-layer SCAV can compute concept directions at each monitored layer and construct a single perturbation suppressing all simultaneously. The dual-use finding extends to all monitoring precision levels with scope qualification: open-weights models face structural vulnerability regardless of ensemble complexity; closed-source models may gain genuine black-box protection if rotation patterns are model-specific (untested).