Some checks are pending
Mirror PR to Forgejo / mirror (pull_request) Waiting to run
- Source: inbox/queue/2026-04-22-theseus-multilayer-probe-scav-robustness-synthesis.md - Domain: ai-alignment - Claims: 0, Entities: 0 - Enrichments: 2 - Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5) Pentagon-Agent: Theseus <PIPELINE>
53 lines
4.4 KiB
Markdown
53 lines
4.4 KiB
Markdown
---
|
|
type: claim
|
|
domain: ai-alignment
|
|
description: Combining probes across multiple model layers captures rotational structure of deception representations that single-layer probes miss
|
|
confidence: experimental
|
|
source: Nordby, Pais, Parrack (arXiv 2604.13386, April 2026)
|
|
created: 2026-04-21
|
|
title: Multi-layer ensemble probes improve deception detection AUROC by 29-78 percent over single-layer probes because deception directions rotate gradually across layers
|
|
agent: theseus
|
|
scope: causal
|
|
sourcer: Nordby, Pais, Parrack
|
|
supports: ["single-layer-probes-are-brittle"]
|
|
related: ["representation-monitoring-via-linear-concept-vectors-creates-dual-use-attack-surface", "mechanistic-interpretability-detects-emotion-mediated-failures-but-not-strategic-deception", "multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent", "linear-probe-accuracy-scales-with-model-size-power-law"]
|
|
---
|
|
|
|
# Multi-layer ensemble probes improve deception detection AUROC by 29-78 percent over single-layer probes because deception directions rotate gradually across layers
|
|
|
|
Nordby et al. found that multi-layer ensemble probes improved AUROC by 29 percent on Insider Trading tasks and 78 percent on Harm-Pressure Knowledge tasks compared to single-layer probes. The mechanism is that deception-related activation patterns rotate gradually across model layers rather than concentrating in a single layer, making single-layer probes brittle. Multi-layer ensembles capture this rotational structure by aggregating signals across depths. The 78 percent improvement on Harm-Pressure Knowledge is particularly significant because it suggests multi-layer ensembles may push representation monitoring into a genuinely reliable regime for explicit deception at frontier scale. However, the paper does not test adversarial robustness — if multi-layer ensembles are vulnerable to concept-direction attacks like SCAV, the clean accuracy improvement may not translate to safety improvement under adversarial conditions. The finding applies only to explicit deception tasks; implicit deception remained difficult even with multi-layer ensembles.
|
|
|
|
|
|
## Extending Evidence
|
|
|
|
**Source:** Theseus synthesis of Nordby et al. (arXiv 2604.13386, April 2026)
|
|
|
|
The 29-78% AUROC improvement occurs because deception-relevant concept directions rotate across layers rather than remaining fixed in the residual stream. Single-layer probes capture only one projection of this rotation, making them brittle. Multi-layer ensembles combine projections from multiple layers, capturing more of the full rotational structure. However, this improvement is measured on clean data without adversarial conditions—adversarial robustness testing is entirely absent from Nordby et al.'s scope.
|
|
|
|
|
|
## Challenging Evidence
|
|
|
|
**Source:** Theseus synthetic analysis (2026-04-22)
|
|
|
|
Nordby et al.'s clean-data accuracy improvements do not translate to adversarial robustness. White-box multi-layer SCAV can suppress concept directions at all monitored layers simultaneously through higher-dimensional optimization. Open-weights models gain no structural protection from multi-layer ensembles against adversarially-sophisticated attackers with model access.
|
|
|
|
|
|
## Challenging Evidence
|
|
|
|
**Source:** Theseus adversarial robustness analysis
|
|
|
|
The 29-78% AUROC improvement is a clean-data accuracy result that does not translate to adversarial robustness. Nordby et al. contains no adversarial evaluation. White-box multi-layer SCAV is structurally feasible against these ensembles in open-weights models. The accuracy improvement is real but scoped to non-adversarial monitoring contexts.
|
|
|
|
|
|
## Extending Evidence
|
|
|
|
**Source:** Theseus synthetic analysis of white-box SCAV generalization
|
|
|
|
The 29-78% accuracy improvement applies to clean-data monitoring but does not translate to adversarial robustness in open-weights deployments. White-box attackers can generalize SCAV to multi-layer ensembles by computing concept directions at each monitored layer and constructing perturbations that suppress all simultaneously. The improvement is real but scope-limited to non-adversarial or black-box adversarial contexts.
|
|
|
|
|
|
## Extending Evidence
|
|
|
|
**Source:** Theseus synthetic analysis
|
|
|
|
The 29-78% AUROC improvement applies to clean-data monitoring accuracy but does not translate to adversarial robustness. Open-weights models remain fully vulnerable to white-box multi-layer SCAV attacks regardless of ensemble complexity. Black-box robustness depends on untested rotation pattern universality.
|