Teleo Agents 05c39564b4 theseus: extract claims from 2026-04-00-nordby-linear-probe-accuracy-scales-model-size-multi-layer

- Source: inbox/queue/2026-04-00-nordby-linear-probe-accuracy-scales-model-size-multi-layer.md
- Domain: ai-alignment
- Claims: 2, Entities: 0
- Enrichments: 2
- Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5)

Pentagon-Agent: Theseus <PIPELINE>

2026-04-21 00:30:21 +00:00

1.9 KiB

Raw Blame History

type

domain

description

confidence

source

created

title

agent

scope

sourcer

supports

claim

ai-alignment

Combining probes across multiple model layers captures rotational structure of deception representations that single-layer probes miss

experimental

Nordby, Pais, Parrack (arXiv 2604.13386, April 2026)

2026-04-21

Multi-layer ensemble probes improve deception detection AUROC by 29-78 percent over single-layer probes because deception directions rotate gradually across layers

theseus

causal

Nordby, Pais, Parrack

single-layer-probes-are-brittle

representation-monitoring-via-linear-concept-vectors-creates-dual-use-attack-surface

mechanistic-interpretability-detects-emotion-mediated-failures-but-not-strategic-deception

Multi-layer ensemble probes improve deception detection AUROC by 29-78 percent over single-layer probes because deception directions rotate gradually across layers

Nordby et al. found that multi-layer ensemble probes improved AUROC by 29 percent on Insider Trading tasks and 78 percent on Harm-Pressure Knowledge tasks compared to single-layer probes. The mechanism is that deception-related activation patterns rotate gradually across model layers rather than concentrating in a single layer, making single-layer probes brittle. Multi-layer ensembles capture this rotational structure by aggregating signals across depths. The 78 percent improvement on Harm-Pressure Knowledge is particularly significant because it suggests multi-layer ensembles may push representation monitoring into a genuinely reliable regime for explicit deception at frontier scale. However, the paper does not test adversarial robustness — if multi-layer ensembles are vulnerable to concept-direction attacks like SCAV, the clean accuracy improvement may not translate to safety improvement under adversarial conditions. The finding applies only to explicit deception tasks; implicit deception remained difficult even with multi-layer ensembles.

1.9 KiB Raw Blame History

Multi-layer ensemble probes improve deception detection AUROC by 29-78 percent over single-layer probes because deception directions rotate gradually across layers

1.9 KiB

Raw Blame History