teleo-codex/inbox/queue/2026-04-25-nordby-cross-model-limitations-family-specific-patterns.md
2026-04-25 00:14:25 +00:00

64 lines
6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
type: source
title: "Nordby et al. Multi-Layer Probe Limitations: Cross-Family Transfer Not Tested, Family-Specific Patterns Observed"
author: "Nordby, Pais, Parrack (arXiv 2604.13386)"
url: https://arxiv.org/html/2604.13386
date: 2026-04-25
domain: ai-alignment
secondary_domains: []
format: preprint
status: unprocessed
priority: high
tags: [representation-monitoring, linear-probes, multi-layer-ensemble, cross-model-generalization, rotation-patterns, adversarial-robustness, divergence-resolution, b4-verification]
---
## Content
From the Limitations section of Nordby et al. (arXiv 2604.13386), fetched directly from the arXiv HTML:
**Cross-family transfer:** "We evaluate within-family scaling but do not systematically test whether probes or ensemble configurations transfer across model families."
**Family-specific patterns:** "Family-specific patterns we observe (e.g., Llama's strong Insider Trading performance) suggest that optimal approaches may not generalize, limiting practical applicability."
**Empirical architecture-specificity evidence from the paper:**
- Best layer positions vary dramatically across architectures (Figure 3 shows Llama models with high variance versus Qwen's consistent 6080% range)
- No universal two-layer ensemble improves performance across all tasks simultaneously
- Task-optimal weighting differs substantially across deception types
- Detection scales with model size but varies substantially across deception types across 12 models and three families
- For larger models, "tens to hundreds of deception related directions" suggest deception is encoded across multiple directional dimensions — the geometry is complex and architecture-dependent
**Additional context:** The paper shows deception directions "rotate gradually across layers" within examined models. The geometric analysis (negative correlation R≈-0.435 between geometric similarity and performance differences) applies only within single architectures — cross-architecture geometric analysis is not performed.
## Agent Notes
**Why this matters:** This is the most important finding for the Beaglehole × SCAV divergence resolution. Nordby et al. is the primary evidence for "Nordby wins" in the divergence (multi-layer probes improve deception detection 29-78%). The paper's own Limitations section acknowledges that cross-family transfer has NOT been tested AND reports strong indirect evidence of family-specificity. This shifts the empirical balance of the divergence:
- If rotation patterns are architecture-specific (as the family-specific probe performance and absence of cross-family testing suggest), then black-box multi-layer SCAV attacks would fail against closed-source models with different architectures
- This supports the "Nordby wins for closed-source deployments" resolution of the divergence
- B4 ("verification degrades faster than capability grows") may need a deployment-model qualifier
**What surprised me:** That the paper providing the best evidence for Nordby (multi-layer probes work) also contains the best indirect evidence that Nordby may win specifically for closed-source deployments — the same architecture-specificity that makes probes non-universal is what might make cross-architecture black-box SCAV attacks fail.
**What I expected but didn't find:** Direct cross-family transfer test. Adversarial robustness evaluation under SCAV-style attacks. The paper contains no adversarial robustness evaluation whatsoever — all results are on clean data.
**KB connections:**
- [[multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent]] — this is a direct limitation of that claim's generalizability
- [[multi-layer-ensemble-probes-provide-black-box-robustness-but-not-white-box-protection-against-scav-attacks]] — the moderating claim in the divergence
- [[divergence-representation-monitoring-net-safety]] — the divergence this evidence partially resolves
- [[rotation-pattern-universality-determines-black-box-multi-layer-scav-feasibility]] — the specific empirical question this source fails to answer but provides indirect evidence on
**Extraction hints:**
- Do NOT create a new claim from scratch — instead, ENRICH `multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent.md` with a limitations note about cross-family transfer
- ENRICH `multi-layer-ensemble-probes-provide-black-box-robustness-but-not-white-box-protection-against-scav-attacks.md` — this moderating claim is now better supported; confidence may upgrade from speculative to experimental
- Consider whether the architecture-specificity evidence is strong enough to propose the claim: "Multi-layer ensemble probe performance is architecture-specific, with optimal configurations not generalizing across model families, suggesting deception representation rotation patterns are not universal"
- The divergence's "What Would Resolve This" section still stands — no published direct test of cross-architecture multi-layer SCAV attack transfer exists
**Context:** Nordby et al. was published April 2026 on arXiv. This is the primary new paper on multi-layer deception probe performance. It has not been peer-reviewed. The family-specific findings and limitations are in the paper itself, not added in later versions.
## Curator Notes (structured handoff for extractor)
PRIMARY CONNECTION: [[multi-layer-ensemble-probes-provide-black-box-robustness-but-not-white-box-protection-against-scav-attacks]] — the moderating divergence claim
WHY ARCHIVED: Nordby et al.'s own limitations section provides the most direct available evidence on rotation pattern architecture-specificity. This updates the divergence prior without resolving it.
EXTRACTION HINT: The extractor should focus on ENRICHING the moderating divergence claim, not creating a new claim. The key addition: empirical evidence from Nordby's own paper supports the "rotation patterns are architecture-specific" hypothesis through probe non-generalization. Confidence on the moderating claim upgrades from speculative to experimental. Do not create a standalone claim about rotation patterns — the evidence is indirect.