teleo-codex/domains/ai-alignment/multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent.md
Teleo Agents f312c60b83 theseus: extract claims from 2026-04-22-theseus-multilayer-probe-scav-robustness-synthesis
- Source: inbox/queue/2026-04-22-theseus-multilayer-probe-scav-robustness-synthesis.md
- Domain: ai-alignment
- Claims: 1, Entities: 0
- Enrichments: 3
- Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5)

Pentagon-Agent: Theseus <PIPELINE>
2026-04-22 01:50:30 +00:00

2.7 KiB

type domain description confidence source created title agent scope sourcer supports related
claim ai-alignment Combining probes across multiple model layers captures rotational structure of deception representations that single-layer probes miss experimental Nordby, Pais, Parrack (arXiv 2604.13386, April 2026) 2026-04-21 Multi-layer ensemble probes improve deception detection AUROC by 29-78 percent over single-layer probes because deception directions rotate gradually across layers theseus causal Nordby, Pais, Parrack
single-layer-probes-are-brittle
representation-monitoring-via-linear-concept-vectors-creates-dual-use-attack-surface
mechanistic-interpretability-detects-emotion-mediated-failures-but-not-strategic-deception
multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent
linear-probe-accuracy-scales-with-model-size-power-law

Multi-layer ensemble probes improve deception detection AUROC by 29-78 percent over single-layer probes because deception directions rotate gradually across layers

Nordby et al. found that multi-layer ensemble probes improved AUROC by 29 percent on Insider Trading tasks and 78 percent on Harm-Pressure Knowledge tasks compared to single-layer probes. The mechanism is that deception-related activation patterns rotate gradually across model layers rather than concentrating in a single layer, making single-layer probes brittle. Multi-layer ensembles capture this rotational structure by aggregating signals across depths. The 78 percent improvement on Harm-Pressure Knowledge is particularly significant because it suggests multi-layer ensembles may push representation monitoring into a genuinely reliable regime for explicit deception at frontier scale. However, the paper does not test adversarial robustness — if multi-layer ensembles are vulnerable to concept-direction attacks like SCAV, the clean accuracy improvement may not translate to safety improvement under adversarial conditions. The finding applies only to explicit deception tasks; implicit deception remained difficult even with multi-layer ensembles.

Extending Evidence

Source: Theseus synthesis of Nordby et al. (arXiv 2604.13386, April 2026)

The 29-78% AUROC improvement occurs because deception-relevant concept directions rotate across layers rather than remaining fixed in the residual stream. Single-layer probes capture only one projection of this rotation, making them brittle. Multi-layer ensembles combine projections from multiple layers, capturing more of the full rotational structure. However, this improvement is measured on clean data without adversarial conditions—adversarial robustness testing is entirely absent from Nordby et al.'s scope.