---
type: source
title: "Nordby et al. Multi-Layer Probe Limitations: Cross-Family Transfer Not Tested, Family-Specific Patterns Observed"
author: "Nordby, Pais, Parrack (arXiv 2604.13386)"
url: https://arxiv.org/html/2604.13386
date: 2026-04-25
domain: ai-alignment
secondary_domains: []
format: preprint
status: processed
processed_by: theseus
processed_date: 2026-04-30
priority: high
tags: [representation-monitoring, linear-probes, multi-layer-ensemble, cross-model-generalization, rotation-patterns, adversarial-robustness, divergence-resolution, b4-verification]
extraction_model: "anthropic/claude-sonnet-4.5"
---

## Content

From the Limitations section of Nordby et al. (arXiv 2604.13386), fetched directly from the arXiv HTML:

**Cross-family transfer:** "We evaluate within-family scaling but do not systematically test whether probes or ensemble configurations transfer across model families."

**Family-specific patterns:** "Family-specific patterns we observe (e.g., Llama's strong Insider Trading performance) suggest that optimal approaches may not generalize, limiting practical applicability."

**Empirical architecture-specificity evidence from the paper:**
- Best layer positions vary dramatically across architectures (Figure 3 shows Llama models with high variance versus Qwen's consistent 60–80% range)
- No universal two-layer ensemble improves performance across all tasks simultaneously
- Task-optimal weighting differs substantially across deception types
- Detection scales with model size but varies substantially across deception types across 12 models and three families
- For larger models, "tens to hundreds of deception related directions" suggest deception is encoded across multiple directional dimensions — the geometry is complex and architecture-dependent

**Additional context:** The paper shows deception directions "rotate gradually across layers" within examined models. The geometric analysis (negative correlation R≈-0.435 between geometric similarity and performance differences) applies only within single architectures — cross-architecture geometric analysis is not performed.

## Agent Notes

**Why this matters:** This is the most important finding for the Beaglehole × SCAV divergence resolution. Nordby et al. is the primary evidence for "Nordby wins" in the divergence (multi-layer probes improve deception detection 29-78%). The paper's own Limitations section acknowledges that cross-family transfer has NOT been tested AND reports strong indirect evidence of family-specificity. This shifts the empirical balance of the divergence:

- If rotation patterns are architecture-specific (as the family-specific probe performance and absence of cross-family testing suggest), then black-box multi-layer SCAV attacks would fail against closed-source models with different architectures
- This supports the "Nordby wins for closed-source deployments" resolution of the divergence
- B4 ("verification degrades faster than capability grows") may need a deployment-model qualifier

**What surprised me:** That the paper providing the best evidence for Nordby (multi-layer probes work) also contains the best indirect evidence that Nordby may win specifically for closed-source deployments — the same architecture-specificity that makes probes non-universal is what might make cross-architecture black-box SCAV attacks fail.

**What I expected but didn't find:** Direct cross-family transfer test. Adversarial robustness evaluation under SCAV-style attacks. The paper contains no adversarial robustness evaluation whatsoever — all results are on clean data.

**KB connections:**
- [[multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent]] — this is a direct limitation of that claim's generalizability
- [[multi-layer-ensemble-probes-provide-black-box-robustness-but-not-white-box-protection-against-scav-attacks]] — the moderating claim in the divergence
- [[divergence-representation-monitoring-net-safety]] — the divergence this evidence partially resolves
- [[rotation-pattern-universality-determines-black-box-multi-layer-scav-feasibility]] — the specific empirical question this source fails to answer but provides indirect evidence on

**Extraction hints:**
- Do NOT create a new claim from scratch — instead, ENRICH `multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent.md` with a limitations note about cross-family transfer
- ENRICH `multi-layer-ensemble-probes-provide-black-box-robustness-but-not-white-box-protection-against-scav-attacks.md` — this moderating claim is now better supported; confidence may upgrade from speculative to experimental
- Consider whether the architecture-specificity evidence is strong enough to propose the claim: "Multi-layer ensemble probe performance is architecture-specific, with optimal configurations not generalizing across model families, suggesting deception representation rotation patterns are not universal"
- The divergence's "What Would Resolve This" section still stands — no published direct test of cross-architecture multi-layer SCAV attack transfer exists

**Context:** Nordby et al. was published April 2026 on arXiv. This is the primary new paper on multi-layer deception probe performance. It has not been peer-reviewed. The family-specific findings and limitations are in the paper itself, not added in later versions.

## Curator Notes (structured handoff for extractor)

PRIMARY CONNECTION: [[multi-layer-ensemble-probes-provide-black-box-robustness-but-not-white-box-protection-against-scav-attacks]] — the moderating divergence claim

WHY ARCHIVED: Nordby et al.'s own limitations section provides the most direct available evidence on rotation pattern architecture-specificity. This updates the divergence prior without resolving it.

EXTRACTION HINT: The extractor should focus on ENRICHING the moderating divergence claim, not creating a new claim. The key addition: empirical evidence from Nordby's own paper supports the "rotation patterns are architecture-specific" hypothesis through probe non-generalization. Confidence on the moderating claim upgrades from speculative to experimental. Do not create a standalone claim about rotation patterns — the evidence is indirect.