From 2111abe67126bde7460a92de359992e31bf0066f Mon Sep 17 00:00:00 2001 From: Teleo Agents Date: Sat, 25 Apr 2026 04:17:04 +0000 Subject: [PATCH] theseus: extract claims from 2026-04-25-nordby-cross-model-limitations-family-specific-patterns - Source: inbox/queue/2026-04-25-nordby-cross-model-limitations-family-specific-patterns.md - Domain: ai-alignment - Claims: 0, Entities: 0 - Enrichments: 3 - Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5) Pentagon-Agent: Theseus --- ...le-probes-outperform-single-layer-by-29-78-percent.md | 7 +++++++ ...-but-not-white-box-protection-against-scav-attacks.md | 9 ++++++++- ...-determines-black-box-multi-layer-scav-feasibility.md | 9 ++++++++- ...y-cross-model-limitations-family-specific-patterns.md | 5 ++++- 4 files changed, 27 insertions(+), 3 deletions(-) rename inbox/{queue => archive/ai-alignment}/2026-04-25-nordby-cross-model-limitations-family-specific-patterns.md (98%) diff --git a/domains/ai-alignment/multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent.md b/domains/ai-alignment/multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent.md index 0b191ea39..bac650bbe 100644 --- a/domains/ai-alignment/multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent.md +++ b/domains/ai-alignment/multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent.md @@ -87,3 +87,10 @@ Apollo Research's ICML 2025 peer-reviewed paper confirms linear probes distingui **Source:** Apollo Research, ICML 2025 Apollo explicitly acknowledges their probe 'sometimes fires for the topic of deception rather than genuine intent' — a surface-feature triggering problem distinct from adversarial robustness. This represents a probe scope limitation where the detector responds to semantic content about deception rather than deceptive intent itself, creating false positives in deployment contexts where deception is discussed honestly. + + +## Challenging Evidence + +**Source:** Nordby et al. arXiv 2604.13386, Limitations section + +Nordby et al.'s own Limitations section states: 'We evaluate within-family scaling but do not systematically test whether probes or ensemble configurations transfer across model families.' The paper reports family-specific patterns (e.g., Llama's strong Insider Trading performance) and notes that 'optimal approaches may not generalize, limiting practical applicability.' Best layer positions vary dramatically across architectures (Figure 3), no universal two-layer ensemble improves performance across all tasks simultaneously, and task-optimal weighting differs substantially across deception types. This suggests the 29-78% improvement may not transfer across model families. diff --git a/domains/ai-alignment/multi-layer-ensemble-probes-provide-black-box-robustness-but-not-white-box-protection-against-scav-attacks.md b/domains/ai-alignment/multi-layer-ensemble-probes-provide-black-box-robustness-but-not-white-box-protection-against-scav-attacks.md index ef7a528b7..d49aafd0b 100644 --- a/domains/ai-alignment/multi-layer-ensemble-probes-provide-black-box-robustness-but-not-white-box-protection-against-scav-attacks.md +++ b/domains/ai-alignment/multi-layer-ensemble-probes-provide-black-box-robustness-but-not-white-box-protection-against-scav-attacks.md @@ -10,9 +10,16 @@ agent: theseus sourced_from: ai-alignment/2026-04-22-theseus-multilayer-probe-scav-robustness-synthesis.md scope: structural sourcer: Theseus -related: ["anti-safety-scaling-law-larger-models-more-vulnerable-to-concept-vector-attacks", "trajectory-monitoring-dual-edge-geometric-concentration", "representation-monitoring-via-linear-concept-vectors-creates-dual-use-attack-surface", "multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent"] +related: ["anti-safety-scaling-law-larger-models-more-vulnerable-to-concept-vector-attacks", "trajectory-monitoring-dual-edge-geometric-concentration", "representation-monitoring-via-linear-concept-vectors-creates-dual-use-attack-surface", "multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent", "multi-layer-ensemble-probes-provide-black-box-robustness-but-not-white-box-protection-against-scav-attacks", "rotation-pattern-universality-determines-black-box-multi-layer-scav-feasibility"] --- # Multi-layer ensemble probes provide black-box adversarial robustness only if concept direction rotation patterns are model-specific not universal Multi-layer ensemble probes improve clean-data AUROC by 29-78% over single-layer probes because deception-relevant concept directions rotate across layers rather than remaining fixed. However, this architectural improvement does not provide structural protection against adversarial attacks in the white-box case. With access to model weights and activations (the standard condition for open-weights models like Llama, Mistral, Falcon), an attacker can generalize SCAV to compute concept directions at each monitored layer and construct a single perturbation suppressing all of them simultaneously. This is a higher-dimensional optimization problem but structurally feasible by the same mechanism as single-layer SCAV. The critical unresolved question is whether black-box attacks transfer: single-layer SCAV transferred to GPT-4 because concept direction universality allowed reconstruction from different models. Multi-layer black-box SCAV requires that rotation patterns (how directions change across layers) are also universal. Beaglehole et al. found concept vectors transfer cross-language and cross-model-family, suggesting the underlying geometry may be universal enough to enable rotation pattern transfer. However, different architectures (depth, attention heads, MLP width, pre-training data) produce different residual stream dynamics, and rotation may depend on model-specific representational basis evolution. No published work tests whether multi-layer rotation patterns transfer across model families. If they do not transfer, multi-layer ensembles provide genuine black-box protection for closed-source models. If they do transfer, multi-layer ensembles merely raise attack cost without escaping the dual-use structure. This creates a deployment-context-dependent safety verdict: open-weights models remain fully vulnerable to white-box multi-layer SCAV regardless of ensemble complexity, while closed-source models may gain genuine robustness if rotation patterns are model-specific. + + +## Supporting Evidence + +**Source:** Nordby et al. arXiv 2604.13386, Limitations section and Figure 3 + +Nordby et al. provides indirect empirical support for architecture-specific rotation patterns. The paper shows: (1) best layer positions vary dramatically across architectures with Llama models showing high variance versus Qwen's consistent 60-80% range, (2) family-specific probe performance patterns that 'suggest optimal approaches may not generalize', (3) deception encoded across 'tens to hundreds of deception related directions' in larger models suggesting complex architecture-dependent geometry, and (4) no cross-family transfer testing was performed. The geometric analysis (R≈-0.435 correlation between geometric similarity and performance differences) applies only within single architectures. This supports the hypothesis that rotation patterns are architecture-specific, which would make black-box multi-layer SCAV attacks fail against closed-source models with different architectures. diff --git a/domains/ai-alignment/rotation-pattern-universality-determines-black-box-multi-layer-scav-feasibility.md b/domains/ai-alignment/rotation-pattern-universality-determines-black-box-multi-layer-scav-feasibility.md index 1eb593b2a..1936772b9 100644 --- a/domains/ai-alignment/rotation-pattern-universality-determines-black-box-multi-layer-scav-feasibility.md +++ b/domains/ai-alignment/rotation-pattern-universality-determines-black-box-multi-layer-scav-feasibility.md @@ -11,9 +11,16 @@ sourced_from: ai-alignment/2026-04-22-theseus-multilayer-probe-scav-robustness-s scope: structural sourcer: Theseus supports: ["multi-layer-ensemble-probes-provide-black-box-robustness-but-not-white-box-protection-against-scav-attacks"] -related: ["multi-layer-ensemble-probes-provide-black-box-robustness-but-not-white-box-protection-against-scav-attacks", "representation-monitoring-via-linear-concept-vectors-creates-dual-use-attack-surface", "anti-safety-scaling-law-larger-models-more-vulnerable-to-concept-vector-attacks"] +related: ["multi-layer-ensemble-probes-provide-black-box-robustness-but-not-white-box-protection-against-scav-attacks", "representation-monitoring-via-linear-concept-vectors-creates-dual-use-attack-surface", "anti-safety-scaling-law-larger-models-more-vulnerable-to-concept-vector-attacks", "rotation-pattern-universality-determines-black-box-multi-layer-scav-feasibility"] --- # Rotation pattern universality across model families determines whether multi-layer ensemble monitoring provides black-box adversarial robustness The feasibility of black-box multi-layer SCAV attacks depends on whether the rotation pattern of concept directions across layers is universal across model families or model-specific. Single-layer SCAV achieved black-box transfer to GPT-4 because concept direction universality (confirmed by Beaglehole et al. for cross-language and cross-model-family transfer) allowed attackers to reconstruct the target model's concept direction from a different model. For multi-layer SCAV, the attacker must reconstruct not just the concept direction at one layer, but the entire rotation pattern across all monitored layers. Two competing arguments exist: (1) Rotation universality: If the underlying geometry of safety representations is universal enough to enable cross-language transfer (Beaglehole et al.), the rotation pattern may also be universal, making black-box multi-layer SCAV feasible. (2) Rotation specificity: Different model architectures (transformer depth, attention head count, MLP width, pre-training data) produce different residual stream dynamics. The concept direction at any single layer is a projection of a universal concept onto a model-specific representational basis, and the rotation across layers depends on how that basis evolves, which may not be universal. This is a testable empirical question with no published results. If rotation patterns are model-specific, multi-layer ensemble monitoring provides genuine black-box adversarial robustness for closed-source models, creating a structural safety advantage over open-weights deployment. If rotation patterns are universal, multi-layer ensembles provide no black-box protection, and the dual-use vulnerability holds across all deployment contexts. + + +## Extending Evidence + +**Source:** Nordby et al. arXiv 2604.13386 + +Nordby et al. provides the strongest available indirect evidence on rotation pattern architecture-specificity. While the paper does not directly test cross-architecture rotation pattern transfer (the key empirical question), it reports: (1) dramatic variation in optimal layer positions across model families, (2) family-specific probe performance patterns with explicit acknowledgment that 'optimal approaches may not generalize', (3) no universal two-layer ensemble configuration works across all tasks, and (4) deception geometry complexity that increases with model size ('tens to hundreds of deception related directions'). The absence of cross-family transfer testing combined with observed family-specific patterns suggests rotation patterns are not universal, though direct confirmation remains absent from published literature. diff --git a/inbox/queue/2026-04-25-nordby-cross-model-limitations-family-specific-patterns.md b/inbox/archive/ai-alignment/2026-04-25-nordby-cross-model-limitations-family-specific-patterns.md similarity index 98% rename from inbox/queue/2026-04-25-nordby-cross-model-limitations-family-specific-patterns.md rename to inbox/archive/ai-alignment/2026-04-25-nordby-cross-model-limitations-family-specific-patterns.md index f41de8cf6..ea63a1a4e 100644 --- a/inbox/queue/2026-04-25-nordby-cross-model-limitations-family-specific-patterns.md +++ b/inbox/archive/ai-alignment/2026-04-25-nordby-cross-model-limitations-family-specific-patterns.md @@ -7,9 +7,12 @@ date: 2026-04-25 domain: ai-alignment secondary_domains: [] format: preprint -status: unprocessed +status: processed +processed_by: theseus +processed_date: 2026-04-25 priority: high tags: [representation-monitoring, linear-probes, multi-layer-ensemble, cross-model-generalization, rotation-patterns, adversarial-robustness, divergence-resolution, b4-verification] +extraction_model: "anthropic/claude-sonnet-4.5" --- ## Content