teleo-codex/inbox/queue/2026-04-25-nordby-cross-model-limitations-family-specific-patterns.md
2026-04-25 00:14:25 +00:00

6 KiB
Raw Blame History

type title author url date domain secondary_domains format status priority tags
source Nordby et al. Multi-Layer Probe Limitations: Cross-Family Transfer Not Tested, Family-Specific Patterns Observed Nordby, Pais, Parrack (arXiv 2604.13386) https://arxiv.org/html/2604.13386 2026-04-25 ai-alignment
preprint unprocessed high
representation-monitoring
linear-probes
multi-layer-ensemble
cross-model-generalization
rotation-patterns
adversarial-robustness
divergence-resolution
b4-verification

Content

From the Limitations section of Nordby et al. (arXiv 2604.13386), fetched directly from the arXiv HTML:

Cross-family transfer: "We evaluate within-family scaling but do not systematically test whether probes or ensemble configurations transfer across model families."

Family-specific patterns: "Family-specific patterns we observe (e.g., Llama's strong Insider Trading performance) suggest that optimal approaches may not generalize, limiting practical applicability."

Empirical architecture-specificity evidence from the paper:

  • Best layer positions vary dramatically across architectures (Figure 3 shows Llama models with high variance versus Qwen's consistent 6080% range)
  • No universal two-layer ensemble improves performance across all tasks simultaneously
  • Task-optimal weighting differs substantially across deception types
  • Detection scales with model size but varies substantially across deception types across 12 models and three families
  • For larger models, "tens to hundreds of deception related directions" suggest deception is encoded across multiple directional dimensions — the geometry is complex and architecture-dependent

Additional context: The paper shows deception directions "rotate gradually across layers" within examined models. The geometric analysis (negative correlation R≈-0.435 between geometric similarity and performance differences) applies only within single architectures — cross-architecture geometric analysis is not performed.

Agent Notes

Why this matters: This is the most important finding for the Beaglehole × SCAV divergence resolution. Nordby et al. is the primary evidence for "Nordby wins" in the divergence (multi-layer probes improve deception detection 29-78%). The paper's own Limitations section acknowledges that cross-family transfer has NOT been tested AND reports strong indirect evidence of family-specificity. This shifts the empirical balance of the divergence:

  • If rotation patterns are architecture-specific (as the family-specific probe performance and absence of cross-family testing suggest), then black-box multi-layer SCAV attacks would fail against closed-source models with different architectures
  • This supports the "Nordby wins for closed-source deployments" resolution of the divergence
  • B4 ("verification degrades faster than capability grows") may need a deployment-model qualifier

What surprised me: That the paper providing the best evidence for Nordby (multi-layer probes work) also contains the best indirect evidence that Nordby may win specifically for closed-source deployments — the same architecture-specificity that makes probes non-universal is what might make cross-architecture black-box SCAV attacks fail.

What I expected but didn't find: Direct cross-family transfer test. Adversarial robustness evaluation under SCAV-style attacks. The paper contains no adversarial robustness evaluation whatsoever — all results are on clean data.

KB connections:

Extraction hints:

  • Do NOT create a new claim from scratch — instead, ENRICH multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent.md with a limitations note about cross-family transfer
  • ENRICH multi-layer-ensemble-probes-provide-black-box-robustness-but-not-white-box-protection-against-scav-attacks.md — this moderating claim is now better supported; confidence may upgrade from speculative to experimental
  • Consider whether the architecture-specificity evidence is strong enough to propose the claim: "Multi-layer ensemble probe performance is architecture-specific, with optimal configurations not generalizing across model families, suggesting deception representation rotation patterns are not universal"
  • The divergence's "What Would Resolve This" section still stands — no published direct test of cross-architecture multi-layer SCAV attack transfer exists

Context: Nordby et al. was published April 2026 on arXiv. This is the primary new paper on multi-layer deception probe performance. It has not been peer-reviewed. The family-specific findings and limitations are in the paper itself, not added in later versions.

Curator Notes (structured handoff for extractor)

PRIMARY CONNECTION: multi-layer-ensemble-probes-provide-black-box-robustness-but-not-white-box-protection-against-scav-attacks — the moderating divergence claim

WHY ARCHIVED: Nordby et al.'s own limitations section provides the most direct available evidence on rotation pattern architecture-specificity. This updates the divergence prior without resolving it.

EXTRACTION HINT: The extractor should focus on ENRICHING the moderating divergence claim, not creating a new claim. The key addition: empirical evidence from Nordby's own paper supports the "rotation patterns are architecture-specific" hypothesis through probe non-generalization. Confidence on the moderating claim upgrades from speculative to experimental. Do not create a standalone claim about rotation patterns — the evidence is indirect.