diff --git a/domains/ai-alignment/multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent.md b/domains/ai-alignment/multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent.md index caa5bbd89..dbb971d34 100644 --- a/domains/ai-alignment/multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent.md +++ b/domains/ai-alignment/multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent.md @@ -37,3 +37,10 @@ Nordby et al.'s clean-data accuracy improvements do not translate to adversarial **Source:** Theseus adversarial robustness analysis The 29-78% AUROC improvement is a clean-data accuracy result that does not translate to adversarial robustness. Nordby et al. contains no adversarial evaluation. White-box multi-layer SCAV is structurally feasible against these ensembles in open-weights models. The accuracy improvement is real but scoped to non-adversarial monitoring contexts. + + +## Extending Evidence + +**Source:** Theseus synthetic analysis of SCAV generalization to multi-layer ensembles + +The 29-78% AUROC improvement applies to clean-data monitoring accuracy but does not translate to adversarial robustness. White-box multi-layer SCAV can suppress concept directions at all monitored layers simultaneously. Black-box robustness depends on untested rotation pattern universality: if rotation patterns are model-specific, the accuracy improvement may provide genuine adversarial protection for closed-source models; if universal, it does not. diff --git a/domains/ai-alignment/representation-trajectory-geometry-distinguishes-deceptive-from-sincere-alignment-without-creating-adversarial-attack-surfaces.md b/domains/ai-alignment/representation-trajectory-geometry-distinguishes-deceptive-from-sincere-alignment-without-creating-adversarial-attack-surfaces.md index 9fc0dc718..23918bbf9 100644 --- a/domains/ai-alignment/representation-trajectory-geometry-distinguishes-deceptive-from-sincere-alignment-without-creating-adversarial-attack-surfaces.md +++ b/domains/ai-alignment/representation-trajectory-geometry-distinguishes-deceptive-from-sincere-alignment-without-creating-adversarial-attack-surfaces.md @@ -24,3 +24,10 @@ The method characterizes alignment-relevant properties as geometric patterns in **Source:** Theseus synthetic analysis (2026-04-22) Trajectory geometry monitoring does create adversarial attack surfaces through multi-layer SCAV generalization. The claim that geometric approaches avoid attack surface creation is incorrect for white-box adversaries with access to model weights. May need qualification to 'black-box adversaries only, contingent on rotation pattern specificity.' + + +## Challenging Evidence + +**Source:** Theseus synthetic analysis of multi-layer SCAV generalization + +Multi-layer ensemble analysis shows trajectory geometry monitoring DOES create attack surface, just harder to exploit. White-box adversaries can generalize SCAV to multi-layer ensembles by computing concept directions at each monitored layer. The claim that trajectory geometry avoids attack surfaces needs qualification: it may avoid single-layer attack surfaces but remains vulnerable to multi-layer white-box attacks. Only if rotation patterns are model-specific (untested) does black-box deployment provide genuine protection. diff --git a/domains/ai-alignment/trajectory-monitoring-dual-edge-geometric-concentration.md b/domains/ai-alignment/trajectory-monitoring-dual-edge-geometric-concentration.md index bc3c8682a..9bd0f6c7d 100644 --- a/domains/ai-alignment/trajectory-monitoring-dual-edge-geometric-concentration.md +++ b/domains/ai-alignment/trajectory-monitoring-dual-edge-geometric-concentration.md @@ -38,3 +38,10 @@ The dual-use vulnerability extends to multi-layer ensemble monitoring, not just **Source:** Theseus synthetic analysis of Nordby et al. (arXiv 2604.13386, April 2026) Multi-layer ensemble probes (Nordby et al. 2026) improve clean monitoring accuracy 29-78% but provide no structural protection against white-box adversaries in open-weights models. White-box multi-layer SCAV can compute concept directions at each monitored layer and construct a single perturbation suppressing all simultaneously. The dual-use finding extends to all monitoring precision levels with scope qualification: open-weights models face structural vulnerability regardless of ensemble complexity; closed-source models may gain genuine black-box protection if rotation patterns are model-specific (untested). + + +## Extending Evidence + +**Source:** Theseus synthetic analysis of Nordby et al. + SCAV literature + +Multi-layer ensemble probes, despite 29-78% accuracy improvements over single-layer probes, do not escape the dual-use structure in white-box settings. White-box multi-layer SCAV can compute concept directions at each monitored layer and construct perturbations that suppress all simultaneously. The dual-use finding extends to all monitoring precision levels with deployment-context qualification: open-weights models remain fully vulnerable while closed-source models may gain protection if rotation patterns are model-specific (untested).