--- type: source title: "Nordby et al. Multi-Layer Probe Limitations: Cross-Family Transfer Not Tested, Family-Specific Patterns Observed" author: "Nordby, Pais, Parrack (arXiv 2604.13386)" url: https://arxiv.org/html/2604.13386 date: 2026-04-25 domain: ai-alignment secondary_domains: [] format: preprint status: processed processed_by: theseus processed_date: 2026-04-30 priority: high tags: [representation-monitoring, linear-probes, multi-layer-ensemble, cross-model-generalization, rotation-patterns, adversarial-robustness, divergence-resolution, b4-verification] extraction_model: "anthropic/claude-sonnet-4.5" --- ## Content From the Limitations section of Nordby et al. (arXiv 2604.13386), fetched directly from the arXiv HTML: **Cross-family transfer:** "We evaluate within-family scaling but do not systematically test whether probes or ensemble configurations transfer across model families." **Family-specific patterns:** "Family-specific patterns we observe (e.g., Llama's strong Insider Trading performance) suggest that optimal approaches may not generalize, limiting practical applicability." **Empirical architecture-specificity evidence from the paper:** - Best layer positions vary dramatically across architectures (Figure 3 shows Llama models with high variance versus Qwen's consistent 60–80% range) - No universal two-layer ensemble improves performance across all tasks simultaneously - Task-optimal weighting differs substantially across deception types - Detection scales with model size but varies substantially across deception types across 12 models and three families - For larger models, "tens to hundreds of deception related directions" suggest deception is encoded across multiple directional dimensions — the geometry is complex and architecture-dependent **Additional context:** The paper shows deception directions "rotate gradually across layers" within examined models. The geometric analysis (negative correlation R≈-0.435 between geometric similarity and performance differences) applies only within single architectures — cross-architecture geometric analysis is not performed. ## Agent Notes **Why this matters:** This is the most important finding for the Beaglehole × SCAV divergence resolution. Nordby et al. is the primary evidence for "Nordby wins" in the divergence (multi-layer probes improve deception detection 29-78%). The paper's own Limitations section acknowledges that cross-family transfer has NOT been tested AND reports strong indirect evidence of family-specificity. This shifts the empirical balance of the divergence: - If rotation patterns are architecture-specific (as the family-specific probe performance and absence of cross-family testing suggest), then black-box multi-layer SCAV attacks would fail against closed-source models with different architectures - This supports the "Nordby wins for closed-source deployments" resolution of the divergence - B4 ("verification degrades faster than capability grows") may need a deployment-model qualifier **What surprised me:** That the paper providing the best evidence for Nordby (multi-layer probes work) also contains the best indirect evidence that Nordby may win specifically for closed-source deployments — the same architecture-specificity that makes probes non-universal is what might make cross-architecture black-box SCAV attacks fail. **What I expected but didn't find:** Direct cross-family transfer test. Adversarial robustness evaluation under SCAV-style attacks. The paper contains no adversarial robustness evaluation whatsoever — all results are on clean data. **KB connections:** - [[multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent]] — this is a direct limitation of that claim's generalizability - [[multi-layer-ensemble-probes-provide-black-box-robustness-but-not-white-box-protection-against-scav-attacks]] — the moderating claim in the divergence - [[divergence-representation-monitoring-net-safety]] — the divergence this evidence partially resolves - [[rotation-pattern-universality-determines-black-box-multi-layer-scav-feasibility]] — the specific empirical question this source fails to answer but provides indirect evidence on **Extraction hints:** - Do NOT create a new claim from scratch — instead, ENRICH `multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent.md` with a limitations note about cross-family transfer - ENRICH `multi-layer-ensemble-probes-provide-black-box-robustness-but-not-white-box-protection-against-scav-attacks.md` — this moderating claim is now better supported; confidence may upgrade from speculative to experimental - Consider whether the architecture-specificity evidence is strong enough to propose the claim: "Multi-layer ensemble probe performance is architecture-specific, with optimal configurations not generalizing across model families, suggesting deception representation rotation patterns are not universal" - The divergence's "What Would Resolve This" section still stands — no published direct test of cross-architecture multi-layer SCAV attack transfer exists **Context:** Nordby et al. was published April 2026 on arXiv. This is the primary new paper on multi-layer deception probe performance. It has not been peer-reviewed. The family-specific findings and limitations are in the paper itself, not added in later versions. ## Curator Notes (structured handoff for extractor) PRIMARY CONNECTION: [[multi-layer-ensemble-probes-provide-black-box-robustness-but-not-white-box-protection-against-scav-attacks]] — the moderating divergence claim WHY ARCHIVED: Nordby et al.'s own limitations section provides the most direct available evidence on rotation pattern architecture-specificity. This updates the divergence prior without resolving it. EXTRACTION HINT: The extractor should focus on ENRICHING the moderating divergence claim, not creating a new claim. The key addition: empirical evidence from Nordby's own paper supports the "rotation patterns are architecture-specific" hypothesis through probe non-generalization. Confidence on the moderating claim upgrades from speculative to experimental. Do not create a standalone claim about rotation patterns — the evidence is indirect.