Teleo Agents f312c60b83 theseus: extract claims from 2026-04-22-theseus-multilayer-probe-scav-robustness-synthesis

- Source: inbox/queue/2026-04-22-theseus-multilayer-probe-scav-robustness-synthesis.md
- Domain: ai-alignment
- Claims: 1, Entities: 0
- Enrichments: 3
- Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5)

Pentagon-Agent: Theseus <PIPELINE>

2026-04-22 01:50:30 +00:00

2.7 KiB

Raw Blame History

type

domain

description

confidence

source

created

title

agent

scope

sourcer

supports

claim

ai-alignment

Combining probes across multiple model layers captures rotational structure of deception representations that single-layer probes miss

experimental

Nordby, Pais, Parrack (arXiv 2604.13386, April 2026)

2026-04-21

Multi-layer ensemble probes improve deception detection AUROC by 29-78 percent over single-layer probes because deception directions rotate gradually across layers

theseus

causal

Nordby, Pais, Parrack

single-layer-probes-are-brittle

representation-monitoring-via-linear-concept-vectors-creates-dual-use-attack-surface

mechanistic-interpretability-detects-emotion-mediated-failures-but-not-strategic-deception

multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent

linear-probe-accuracy-scales-with-model-size-power-law

Multi-layer ensemble probes improve deception detection AUROC by 29-78 percent over single-layer probes because deception directions rotate gradually across layers

Nordby et al. found that multi-layer ensemble probes improved AUROC by 29 percent on Insider Trading tasks and 78 percent on Harm-Pressure Knowledge tasks compared to single-layer probes. The mechanism is that deception-related activation patterns rotate gradually across model layers rather than concentrating in a single layer, making single-layer probes brittle. Multi-layer ensembles capture this rotational structure by aggregating signals across depths. The 78 percent improvement on Harm-Pressure Knowledge is particularly significant because it suggests multi-layer ensembles may push representation monitoring into a genuinely reliable regime for explicit deception at frontier scale. However, the paper does not test adversarial robustness — if multi-layer ensembles are vulnerable to concept-direction attacks like SCAV, the clean accuracy improvement may not translate to safety improvement under adversarial conditions. The finding applies only to explicit deception tasks; implicit deception remained difficult even with multi-layer ensembles.

Extending Evidence

Source: Theseus synthesis of Nordby et al. (arXiv 2604.13386, April 2026)

The 29-78% AUROC improvement occurs because deception-relevant concept directions rotate across layers rather than remaining fixed in the residual stream. Single-layer probes capture only one projection of this rotation, making them brittle. Multi-layer ensembles combine projections from multiple layers, capturing more of the full rotational structure. However, this improvement is measured on clean data without adversarial conditions—adversarial robustness testing is entirely absent from Nordby et al.'s scope.

2.7 KiB Raw Blame History

Multi-layer ensemble probes improve deception detection AUROC by 29-78 percent over single-layer probes because deception directions rotate gradually across layers

Extending Evidence

2.7 KiB

Raw Blame History