theseus: extract claims from 2026-04-25-nordby-cross-model-limitations-family-specific-patterns
Some checks failed
Mirror PR to Forgejo / mirror (pull_request) Has been cancelled
Some checks failed
Mirror PR to Forgejo / mirror (pull_request) Has been cancelled
- Source: inbox/queue/2026-04-25-nordby-cross-model-limitations-family-specific-patterns.md - Domain: ai-alignment - Claims: 0, Entities: 0 - Enrichments: 3 - Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5) Pentagon-Agent: Theseus <PIPELINE>
This commit is contained in:
parent
265fa01883
commit
e4f1398db6
4 changed files with 27 additions and 3 deletions
|
|
@ -73,3 +73,10 @@ The 29-78% accuracy improvement applies to clean-data monitoring but does not tr
|
|||
**Source:** Theseus synthetic analysis
|
||||
|
||||
The 29-78% AUROC improvement applies to clean-data monitoring accuracy but does not translate to adversarial robustness. Open-weights models remain fully vulnerable to white-box multi-layer SCAV attacks regardless of ensemble complexity. Black-box robustness depends on untested rotation pattern universality.
|
||||
|
||||
|
||||
## Challenging Evidence
|
||||
|
||||
**Source:** Nordby et al. 2026, arXiv 2604.13386, Limitations section
|
||||
|
||||
Nordby et al.'s own Limitations section states: 'We evaluate within-family scaling but do not systematically test whether probes or ensemble configurations transfer across model families.' The paper reports family-specific patterns (e.g., Llama's strong Insider Trading performance) and notes that 'optimal approaches may not generalize, limiting practical applicability.' Best layer positions vary dramatically across architectures (Figure 3), no universal two-layer ensemble improves performance across all tasks simultaneously, and task-optimal weighting differs substantially across deception types. This suggests the 29-78% improvement may not generalize beyond the tested model families.
|
||||
|
|
|
|||
|
|
@ -10,9 +10,16 @@ agent: theseus
|
|||
sourced_from: ai-alignment/2026-04-22-theseus-multilayer-probe-scav-robustness-synthesis.md
|
||||
scope: structural
|
||||
sourcer: Theseus
|
||||
related: ["anti-safety-scaling-law-larger-models-more-vulnerable-to-concept-vector-attacks", "trajectory-monitoring-dual-edge-geometric-concentration", "representation-monitoring-via-linear-concept-vectors-creates-dual-use-attack-surface", "multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent"]
|
||||
related: ["anti-safety-scaling-law-larger-models-more-vulnerable-to-concept-vector-attacks", "trajectory-monitoring-dual-edge-geometric-concentration", "representation-monitoring-via-linear-concept-vectors-creates-dual-use-attack-surface", "multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent", "multi-layer-ensemble-probes-provide-black-box-robustness-but-not-white-box-protection-against-scav-attacks", "rotation-pattern-universality-determines-black-box-multi-layer-scav-feasibility"]
|
||||
---
|
||||
|
||||
# Multi-layer ensemble probes provide black-box adversarial robustness only if concept direction rotation patterns are model-specific not universal
|
||||
|
||||
Multi-layer ensemble probes improve clean-data AUROC by 29-78% over single-layer probes because deception-relevant concept directions rotate across layers rather than remaining fixed. However, this architectural improvement does not provide structural protection against adversarial attacks in the white-box case. With access to model weights and activations (the standard condition for open-weights models like Llama, Mistral, Falcon), an attacker can generalize SCAV to compute concept directions at each monitored layer and construct a single perturbation suppressing all of them simultaneously. This is a higher-dimensional optimization problem but structurally feasible by the same mechanism as single-layer SCAV. The critical unresolved question is whether black-box attacks transfer: single-layer SCAV transferred to GPT-4 because concept direction universality allowed reconstruction from different models. Multi-layer black-box SCAV requires that rotation patterns (how directions change across layers) are also universal. Beaglehole et al. found concept vectors transfer cross-language and cross-model-family, suggesting the underlying geometry may be universal enough to enable rotation pattern transfer. However, different architectures (depth, attention heads, MLP width, pre-training data) produce different residual stream dynamics, and rotation may depend on model-specific representational basis evolution. No published work tests whether multi-layer rotation patterns transfer across model families. If they do not transfer, multi-layer ensembles provide genuine black-box protection for closed-source models. If they do transfer, multi-layer ensembles merely raise attack cost without escaping the dual-use structure. This creates a deployment-context-dependent safety verdict: open-weights models remain fully vulnerable to white-box multi-layer SCAV regardless of ensemble complexity, while closed-source models may gain genuine robustness if rotation patterns are model-specific.
|
||||
|
||||
|
||||
## Extending Evidence
|
||||
|
||||
**Source:** Nordby et al. 2026, arXiv 2604.13386, Limitations + empirical results
|
||||
|
||||
Nordby et al. provides indirect empirical evidence for architecture-specificity of rotation patterns through probe non-generalization. The paper shows: (1) best layer positions vary dramatically across architectures, (2) family-specific patterns in probe performance (Llama vs Qwen), (3) no universal ensemble configuration works across tasks, (4) 'tens to hundreds of deception related directions' suggesting complex architecture-dependent geometry, and (5) rotation patterns analyzed only within single architectures, not cross-architecture. The authors explicitly acknowledge cross-family transfer was not tested. This supports the hypothesis that black-box multi-layer SCAV attacks would fail against closed-source models with different architectures, strengthening the 'Nordby wins for closed-source deployments' resolution of the divergence. Confidence upgrades from speculative to experimental.
|
||||
|
|
|
|||
|
|
@ -11,9 +11,16 @@ sourced_from: ai-alignment/2026-04-22-theseus-multilayer-probe-scav-robustness-s
|
|||
scope: structural
|
||||
sourcer: Theseus
|
||||
supports: ["multi-layer-ensemble-probes-provide-black-box-robustness-but-not-white-box-protection-against-scav-attacks"]
|
||||
related: ["multi-layer-ensemble-probes-provide-black-box-robustness-but-not-white-box-protection-against-scav-attacks", "representation-monitoring-via-linear-concept-vectors-creates-dual-use-attack-surface", "anti-safety-scaling-law-larger-models-more-vulnerable-to-concept-vector-attacks"]
|
||||
related: ["multi-layer-ensemble-probes-provide-black-box-robustness-but-not-white-box-protection-against-scav-attacks", "representation-monitoring-via-linear-concept-vectors-creates-dual-use-attack-surface", "anti-safety-scaling-law-larger-models-more-vulnerable-to-concept-vector-attacks", "rotation-pattern-universality-determines-black-box-multi-layer-scav-feasibility"]
|
||||
---
|
||||
|
||||
# Rotation pattern universality across model families determines whether multi-layer ensemble monitoring provides black-box adversarial robustness
|
||||
|
||||
The feasibility of black-box multi-layer SCAV attacks depends on whether the rotation pattern of concept directions across layers is universal across model families or model-specific. Single-layer SCAV achieved black-box transfer to GPT-4 because concept direction universality (confirmed by Beaglehole et al. for cross-language and cross-model-family transfer) allowed attackers to reconstruct the target model's concept direction from a different model. For multi-layer SCAV, the attacker must reconstruct not just the concept direction at one layer, but the entire rotation pattern across all monitored layers. Two competing arguments exist: (1) Rotation universality: If the underlying geometry of safety representations is universal enough to enable cross-language transfer (Beaglehole et al.), the rotation pattern may also be universal, making black-box multi-layer SCAV feasible. (2) Rotation specificity: Different model architectures (transformer depth, attention head count, MLP width, pre-training data) produce different residual stream dynamics. The concept direction at any single layer is a projection of a universal concept onto a model-specific representational basis, and the rotation across layers depends on how that basis evolves, which may not be universal. This is a testable empirical question with no published results. If rotation patterns are model-specific, multi-layer ensemble monitoring provides genuine black-box adversarial robustness for closed-source models, creating a structural safety advantage over open-weights deployment. If rotation patterns are universal, multi-layer ensembles provide no black-box protection, and the dual-use vulnerability holds across all deployment contexts.
|
||||
|
||||
|
||||
## Extending Evidence
|
||||
|
||||
**Source:** Nordby et al. 2026, arXiv 2604.13386
|
||||
|
||||
Nordby et al. provides the strongest available indirect evidence on rotation pattern architecture-specificity. While the paper does not directly test cross-architecture rotation pattern transfer, it shows: (1) probe performance is family-specific with non-generalizing optimal configurations, (2) layer position optima vary dramatically across families (Llama high variance vs Qwen consistent 60-80%), (3) geometric analysis (R≈-0.435 correlation) performed only within single architectures, and (4) 'tens to hundreds of deception related directions' indicating complex geometry that varies by architecture. The absence of cross-family testing combined with strong within-family variation provides experimental-grade evidence that rotation patterns are architecture-specific, though direct cross-architecture geometric analysis remains absent.
|
||||
|
|
|
|||
|
|
@ -7,9 +7,12 @@ date: 2026-04-25
|
|||
domain: ai-alignment
|
||||
secondary_domains: []
|
||||
format: preprint
|
||||
status: unprocessed
|
||||
status: processed
|
||||
processed_by: theseus
|
||||
processed_date: 2026-04-25
|
||||
priority: high
|
||||
tags: [representation-monitoring, linear-probes, multi-layer-ensemble, cross-model-generalization, rotation-patterns, adversarial-robustness, divergence-resolution, b4-verification]
|
||||
extraction_model: "anthropic/claude-sonnet-4.5"
|
||||
---
|
||||
|
||||
## Content
|
||||
Loading…
Reference in a new issue