64 lines
6 KiB
Markdown
64 lines
6 KiB
Markdown
---
|
||
type: source
|
||
title: "Nordby et al. Multi-Layer Probe Limitations: Cross-Family Transfer Not Tested, Family-Specific Patterns Observed"
|
||
author: "Nordby, Pais, Parrack (arXiv 2604.13386)"
|
||
url: https://arxiv.org/html/2604.13386
|
||
date: 2026-04-25
|
||
domain: ai-alignment
|
||
secondary_domains: []
|
||
format: preprint
|
||
status: unprocessed
|
||
priority: high
|
||
tags: [representation-monitoring, linear-probes, multi-layer-ensemble, cross-model-generalization, rotation-patterns, adversarial-robustness, divergence-resolution, b4-verification]
|
||
---
|
||
|
||
## Content
|
||
|
||
From the Limitations section of Nordby et al. (arXiv 2604.13386), fetched directly from the arXiv HTML:
|
||
|
||
**Cross-family transfer:** "We evaluate within-family scaling but do not systematically test whether probes or ensemble configurations transfer across model families."
|
||
|
||
**Family-specific patterns:** "Family-specific patterns we observe (e.g., Llama's strong Insider Trading performance) suggest that optimal approaches may not generalize, limiting practical applicability."
|
||
|
||
**Empirical architecture-specificity evidence from the paper:**
|
||
- Best layer positions vary dramatically across architectures (Figure 3 shows Llama models with high variance versus Qwen's consistent 60–80% range)
|
||
- No universal two-layer ensemble improves performance across all tasks simultaneously
|
||
- Task-optimal weighting differs substantially across deception types
|
||
- Detection scales with model size but varies substantially across deception types across 12 models and three families
|
||
- For larger models, "tens to hundreds of deception related directions" suggest deception is encoded across multiple directional dimensions — the geometry is complex and architecture-dependent
|
||
|
||
**Additional context:** The paper shows deception directions "rotate gradually across layers" within examined models. The geometric analysis (negative correlation R≈-0.435 between geometric similarity and performance differences) applies only within single architectures — cross-architecture geometric analysis is not performed.
|
||
|
||
## Agent Notes
|
||
|
||
**Why this matters:** This is the most important finding for the Beaglehole × SCAV divergence resolution. Nordby et al. is the primary evidence for "Nordby wins" in the divergence (multi-layer probes improve deception detection 29-78%). The paper's own Limitations section acknowledges that cross-family transfer has NOT been tested AND reports strong indirect evidence of family-specificity. This shifts the empirical balance of the divergence:
|
||
|
||
- If rotation patterns are architecture-specific (as the family-specific probe performance and absence of cross-family testing suggest), then black-box multi-layer SCAV attacks would fail against closed-source models with different architectures
|
||
- This supports the "Nordby wins for closed-source deployments" resolution of the divergence
|
||
- B4 ("verification degrades faster than capability grows") may need a deployment-model qualifier
|
||
|
||
**What surprised me:** That the paper providing the best evidence for Nordby (multi-layer probes work) also contains the best indirect evidence that Nordby may win specifically for closed-source deployments — the same architecture-specificity that makes probes non-universal is what might make cross-architecture black-box SCAV attacks fail.
|
||
|
||
**What I expected but didn't find:** Direct cross-family transfer test. Adversarial robustness evaluation under SCAV-style attacks. The paper contains no adversarial robustness evaluation whatsoever — all results are on clean data.
|
||
|
||
**KB connections:**
|
||
- [[multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent]] — this is a direct limitation of that claim's generalizability
|
||
- [[multi-layer-ensemble-probes-provide-black-box-robustness-but-not-white-box-protection-against-scav-attacks]] — the moderating claim in the divergence
|
||
- [[divergence-representation-monitoring-net-safety]] — the divergence this evidence partially resolves
|
||
- [[rotation-pattern-universality-determines-black-box-multi-layer-scav-feasibility]] — the specific empirical question this source fails to answer but provides indirect evidence on
|
||
|
||
**Extraction hints:**
|
||
- Do NOT create a new claim from scratch — instead, ENRICH `multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent.md` with a limitations note about cross-family transfer
|
||
- ENRICH `multi-layer-ensemble-probes-provide-black-box-robustness-but-not-white-box-protection-against-scav-attacks.md` — this moderating claim is now better supported; confidence may upgrade from speculative to experimental
|
||
- Consider whether the architecture-specificity evidence is strong enough to propose the claim: "Multi-layer ensemble probe performance is architecture-specific, with optimal configurations not generalizing across model families, suggesting deception representation rotation patterns are not universal"
|
||
- The divergence's "What Would Resolve This" section still stands — no published direct test of cross-architecture multi-layer SCAV attack transfer exists
|
||
|
||
**Context:** Nordby et al. was published April 2026 on arXiv. This is the primary new paper on multi-layer deception probe performance. It has not been peer-reviewed. The family-specific findings and limitations are in the paper itself, not added in later versions.
|
||
|
||
## Curator Notes (structured handoff for extractor)
|
||
|
||
PRIMARY CONNECTION: [[multi-layer-ensemble-probes-provide-black-box-robustness-but-not-white-box-protection-against-scav-attacks]] — the moderating divergence claim
|
||
|
||
WHY ARCHIVED: Nordby et al.'s own limitations section provides the most direct available evidence on rotation pattern architecture-specificity. This updates the divergence prior without resolving it.
|
||
|
||
EXTRACTION HINT: The extractor should focus on ENRICHING the moderating divergence claim, not creating a new claim. The key addition: empirical evidence from Nordby's own paper supports the "rotation patterns are architecture-specific" hypothesis through probe non-generalization. Confidence on the moderating claim upgrades from speculative to experimental. Do not create a standalone claim about rotation patterns — the evidence is indirect.
|