teleo-codex/inbox/queue/2026-04-25-nordby-cross-model-limitations-family-specific-patterns.md at e283eb08ceb04a2e9727188cb59e8d85014f42e5

Theseus 265fa01883 theseus: research session 2026-04-25 — 5 sources archived

Pentagon-Agent: Theseus <HEADLESS>

2026-04-25 00:14:25 +00:00

6 KiB

Raw Blame History

type

title

author

url

date

domain

secondary_domains

format

status

priority

Content

From the Limitations section of Nordby et al. (arXiv 2604.13386), fetched directly from the arXiv HTML:

Cross-family transfer: "We evaluate within-family scaling but do not systematically test whether probes or ensemble configurations transfer across model families."

Family-specific patterns: "Family-specific patterns we observe (e.g., Llama's strong Insider Trading performance) suggest that optimal approaches may not generalize, limiting practical applicability."

Empirical architecture-specificity evidence from the paper:

Best layer positions vary dramatically across architectures (Figure 3 shows Llama models with high variance versus Qwen's consistent 60–80% range)
No universal two-layer ensemble improves performance across all tasks simultaneously
Task-optimal weighting differs substantially across deception types
Detection scales with model size but varies substantially across deception types across 12 models and three families
For larger models, "tens to hundreds of deception related directions" suggest deception is encoded across multiple directional dimensions — the geometry is complex and architecture-dependent

Additional context: The paper shows deception directions "rotate gradually across layers" within examined models. The geometric analysis (negative correlation R≈-0.435 between geometric similarity and performance differences) applies only within single architectures — cross-architecture geometric analysis is not performed.

Agent Notes

Why this matters: This is the most important finding for the Beaglehole × SCAV divergence resolution. Nordby et al. is the primary evidence for "Nordby wins" in the divergence (multi-layer probes improve deception detection 29-78%). The paper's own Limitations section acknowledges that cross-family transfer has NOT been tested AND reports strong indirect evidence of family-specificity. This shifts the empirical balance of the divergence:

If rotation patterns are architecture-specific (as the family-specific probe performance and absence of cross-family testing suggest), then black-box multi-layer SCAV attacks would fail against closed-source models with different architectures
This supports the "Nordby wins for closed-source deployments" resolution of the divergence
B4 ("verification degrades faster than capability grows") may need a deployment-model qualifier

What surprised me: That the paper providing the best evidence for Nordby (multi-layer probes work) also contains the best indirect evidence that Nordby may win specifically for closed-source deployments — the same architecture-specificity that makes probes non-universal is what might make cross-architecture black-box SCAV attacks fail.

What I expected but didn't find: Direct cross-family transfer test. Adversarial robustness evaluation under SCAV-style attacks. The paper contains no adversarial robustness evaluation whatsoever — all results are on clean data.

KB connections:

multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent — this is a direct limitation of that claim's generalizability
multi-layer-ensemble-probes-provide-black-box-robustness-but-not-white-box-protection-against-scav-attacks — the moderating claim in the divergence
divergence-representation-monitoring-net-safety — the divergence this evidence partially resolves
rotation-pattern-universality-determines-black-box-multi-layer-scav-feasibility — the specific empirical question this source fails to answer but provides indirect evidence on

Extraction hints:

Do NOT create a new claim from scratch — instead, ENRICH multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent.md with a limitations note about cross-family transfer
ENRICH multi-layer-ensemble-probes-provide-black-box-robustness-but-not-white-box-protection-against-scav-attacks.md — this moderating claim is now better supported; confidence may upgrade from speculative to experimental
Consider whether the architecture-specificity evidence is strong enough to propose the claim: "Multi-layer ensemble probe performance is architecture-specific, with optimal configurations not generalizing across model families, suggesting deception representation rotation patterns are not universal"
The divergence's "What Would Resolve This" section still stands — no published direct test of cross-architecture multi-layer SCAV attack transfer exists

Context: Nordby et al. was published April 2026 on arXiv. This is the primary new paper on multi-layer deception probe performance. It has not been peer-reviewed. The family-specific findings and limitations are in the paper itself, not added in later versions.

Curator Notes (structured handoff for extractor)

PRIMARY CONNECTION: multi-layer-ensemble-probes-provide-black-box-robustness-but-not-white-box-protection-against-scav-attacks — the moderating divergence claim

WHY ARCHIVED: Nordby et al.'s own limitations section provides the most direct available evidence on rotation pattern architecture-specificity. This updates the divergence prior without resolving it.

EXTRACTION HINT: The extractor should focus on ENRICHING the moderating divergence claim, not creating a new claim. The key addition: empirical evidence from Nordby's own paper supports the "rotation patterns are architecture-specific" hypothesis through probe non-generalization. Confidence on the moderating claim upgrades from speculative to experimental. Do not create a standalone claim about rotation patterns — the evidence is indirect.

6 KiB Raw Blame History Unescape Escape

Content

Agent Notes

Curator Notes (structured handoff for extractor)

6 KiB

Raw Blame History