--- type: source title: "Linear Probe Accuracy Scales with Model Size and Benefits from Multi-Layer Ensembling" author: "Erik Nordby, Tasha Pais, Aviel Parrack" url: https://arxiv.org/html/2604.13386 date: 2026-04-01 domain: ai-alignment secondary_domains: [] format: paper status: unprocessed priority: high tags: [linear-probes, representation-monitoring, deception-detection, scaling-laws, interpretability, verification] --- ## Content Nordby, Pais, and Parrack test linear probe accuracy for deception detection across 12 models ranging from 0.5B to 176B parameters (April 2026 — very recent). **Key findings:** - Probe accuracy improves ~5% AUROC per 10x increase in parameters (R=0.81) - Multi-layer ensemble probes improve AUROC by **29%** (Insider Trading task) to **78%** (Harm-Pressure Knowledge task) over single-layer probes - Deception directions rotate gradually across layers — single-layer probes are brittle; multi-layer ensembles capture the rotational structure - Explicit deception (role-playing, direct lying instructions) achieves AUROC >0.95 in larger models - Implicit deception through harmful request avoidance remains hard across ALL methods — this is the persistent gap **Critical limitation noted by authors:** Probes may detect "elicitation artifacts rather than underlying deceptive reasoning" — domain gap between training on simple contrastive prompts and evaluating on complex strategic scenarios. The correlation between probe accuracy and model size could reflect improved ability to represent information linearly, not improved ability to detect actual strategic deception. ## Agent Notes **Why this matters:** This is a direct challenge to B4 ("verification degrades faster than capability grows"). If representation monitoring scales with model size at ~5% AUROC per 10x parameters, then verification IS improving as capabilities grow. The multi-layer ensemble finding (29-78% improvement) suggests current single-layer probe deployments are substantially sub-optimal — the monitoring precision hierarchy can improve without waiting for new techniques. **What surprised me:** The 78% AUROC improvement on Harm-Pressure Knowledge tasks with multi-layer ensembles is larger than I expected. If this holds up, multi-layer ensembles may push representation monitoring into a genuinely reliable safety regime for *explicit* deception at frontier scale. The catch is the "implicit deception" gap — exactly the failure mode that matters most for scheming. **What I expected but didn't find:** Any comparison of multi-layer ensemble probes against SCAV-style attacks. The paper doesn't test adversarial robustness of multi-layer ensembles. If multi-layer ensembles are also vulnerable to concept-direction attacks (as single-layer probes are via SCAV), the improvement in clean accuracy may not translate to safety improvement. This gap is the key follow-up question for the Beaglehole × SCAV divergence. **KB connections:** - Extends: Goldowsky-Dill et al. (arXiv 2502.03407) on detecting strategic deception with linear probes - Potentially challenges: B4 belief ("verification degrades faster than capability grows") — but only for explicit deception, not implicit - Connected to: monitoring precision hierarchy (SAE → linear probes → trajectory geometry) - Connected to: Beaglehole × SCAV divergence — if multi-layer ensembles are adversarially robust, this resolves the divergence toward representation monitoring winning **Extraction hints:** Two extractable claims: (1) linear probe accuracy scales with model size following a power-law; (2) multi-layer ensemble probes substantially outperform single-layer. Both need careful scoping — they apply to *explicit deception* detection; implicit deception detection remains unsolved. The scaling claim has direct ERI implications: if verification scales with capabilities, the capability-verification gap may be more contingent than structural. **Context:** Three-author paper from April 2026 — this is fresh, likely pre-peer-review. Authors appear independent (not affiliated with a major lab). Treat as experimental confidence until venue-published. ## Curator Notes (structured handoff for extractor) PRIMARY CONNECTION: Goldowsky-Dill et al. (arXiv 2502.03407) "Detecting Strategic Deception Using Linear Probes" WHY ARCHIVED: First scaling law for representation-monitoring accuracy — ~5% AUROC per 10x parameters. Plus substantial multi-layer ensemble improvement (29-78%). Directly relevant to whether B4's capability-verification gap is structural or contingent. EXTRACTION HINT: Extract two separate claims — one for scaling law, one for multi-layer ensemble improvement. Be explicit about the scope limitation: applies to explicit deception tasks; implicit deception gap confirmed. Flag the adversarial robustness question (untested against SCAV) as a limitation.