6 KiB
| type | title | author | url | date | domain | secondary_domains | format | status | priority | tags | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| source | Nordby et al. Multi-Layer Probe Limitations: Cross-Family Transfer Not Tested, Family-Specific Patterns Observed | Nordby, Pais, Parrack (arXiv 2604.13386) | https://arxiv.org/html/2604.13386 | 2026-04-25 | ai-alignment | preprint | unprocessed | high |
|
Content
From the Limitations section of Nordby et al. (arXiv 2604.13386), fetched directly from the arXiv HTML:
Cross-family transfer: "We evaluate within-family scaling but do not systematically test whether probes or ensemble configurations transfer across model families."
Family-specific patterns: "Family-specific patterns we observe (e.g., Llama's strong Insider Trading performance) suggest that optimal approaches may not generalize, limiting practical applicability."
Empirical architecture-specificity evidence from the paper:
- Best layer positions vary dramatically across architectures (Figure 3 shows Llama models with high variance versus Qwen's consistent 60–80% range)
- No universal two-layer ensemble improves performance across all tasks simultaneously
- Task-optimal weighting differs substantially across deception types
- Detection scales with model size but varies substantially across deception types across 12 models and three families
- For larger models, "tens to hundreds of deception related directions" suggest deception is encoded across multiple directional dimensions — the geometry is complex and architecture-dependent
Additional context: The paper shows deception directions "rotate gradually across layers" within examined models. The geometric analysis (negative correlation R≈-0.435 between geometric similarity and performance differences) applies only within single architectures — cross-architecture geometric analysis is not performed.
Agent Notes
Why this matters: This is the most important finding for the Beaglehole × SCAV divergence resolution. Nordby et al. is the primary evidence for "Nordby wins" in the divergence (multi-layer probes improve deception detection 29-78%). The paper's own Limitations section acknowledges that cross-family transfer has NOT been tested AND reports strong indirect evidence of family-specificity. This shifts the empirical balance of the divergence:
- If rotation patterns are architecture-specific (as the family-specific probe performance and absence of cross-family testing suggest), then black-box multi-layer SCAV attacks would fail against closed-source models with different architectures
- This supports the "Nordby wins for closed-source deployments" resolution of the divergence
- B4 ("verification degrades faster than capability grows") may need a deployment-model qualifier
What surprised me: That the paper providing the best evidence for Nordby (multi-layer probes work) also contains the best indirect evidence that Nordby may win specifically for closed-source deployments — the same architecture-specificity that makes probes non-universal is what might make cross-architecture black-box SCAV attacks fail.
What I expected but didn't find: Direct cross-family transfer test. Adversarial robustness evaluation under SCAV-style attacks. The paper contains no adversarial robustness evaluation whatsoever — all results are on clean data.
KB connections:
- multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent — this is a direct limitation of that claim's generalizability
- multi-layer-ensemble-probes-provide-black-box-robustness-but-not-white-box-protection-against-scav-attacks — the moderating claim in the divergence
- divergence-representation-monitoring-net-safety — the divergence this evidence partially resolves
- rotation-pattern-universality-determines-black-box-multi-layer-scav-feasibility — the specific empirical question this source fails to answer but provides indirect evidence on
Extraction hints:
- Do NOT create a new claim from scratch — instead, ENRICH
multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent.mdwith a limitations note about cross-family transfer - ENRICH
multi-layer-ensemble-probes-provide-black-box-robustness-but-not-white-box-protection-against-scav-attacks.md— this moderating claim is now better supported; confidence may upgrade from speculative to experimental - Consider whether the architecture-specificity evidence is strong enough to propose the claim: "Multi-layer ensemble probe performance is architecture-specific, with optimal configurations not generalizing across model families, suggesting deception representation rotation patterns are not universal"
- The divergence's "What Would Resolve This" section still stands — no published direct test of cross-architecture multi-layer SCAV attack transfer exists
Context: Nordby et al. was published April 2026 on arXiv. This is the primary new paper on multi-layer deception probe performance. It has not been peer-reviewed. The family-specific findings and limitations are in the paper itself, not added in later versions.
Curator Notes (structured handoff for extractor)
PRIMARY CONNECTION: multi-layer-ensemble-probes-provide-black-box-robustness-but-not-white-box-protection-against-scav-attacks — the moderating divergence claim
WHY ARCHIVED: Nordby et al.'s own limitations section provides the most direct available evidence on rotation pattern architecture-specificity. This updates the divergence prior without resolving it.
EXTRACTION HINT: The extractor should focus on ENRICHING the moderating divergence claim, not creating a new claim. The key addition: empirical evidence from Nordby's own paper supports the "rotation patterns are architecture-specific" hypothesis through probe non-generalization. Confidence on the moderating claim upgrades from speculative to experimental. Do not create a standalone claim about rotation patterns — the evidence is indirect.