Mirror PR to Forgejo / mirror (pull_request) Has been cancelled

Details

theseus: extract claims from 2026-04-25-nordby-cross-model-limitations-family-specific-patterns

- Source: inbox/queue/2026-04-25-nordby-cross-model-limitations-family-specific-patterns.md
- Domain: ai-alignment
- Claims: 0, Entities: 0
- Enrichments: 3
- Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5)

Pentagon-Agent: Theseus <PIPELINE>

2026-04-25 00:16:07 +00:00

3.8 KiB

Raw Blame History

type

domain

description

confidence

source

created

title

agent

sourced_from

scope

sourcer

supports

claim

ai-alignment

If deception direction rotation patterns across layers are model-specific rather than universal, closed-source models gain genuine protection that open-weights models cannot achieve

speculative

Theseus synthetic analysis identifying untested empirical question

2026-04-22

Rotation pattern universality across model families determines whether multi-layer ensemble monitoring provides black-box adversarial robustness

theseus

ai-alignment/2026-04-22-theseus-multilayer-probe-scav-robustness-synthesis.md

structural

Theseus

multi-layer-ensemble-probes-provide-black-box-robustness-but-not-white-box-protection-against-scav-attacks

representation-monitoring-via-linear-concept-vectors-creates-dual-use-attack-surface

anti-safety-scaling-law-larger-models-more-vulnerable-to-concept-vector-attacks

rotation-pattern-universality-determines-black-box-multi-layer-scav-feasibility

Rotation pattern universality across model families determines whether multi-layer ensemble monitoring provides black-box adversarial robustness

The feasibility of black-box multi-layer SCAV attacks depends on whether the rotation pattern of concept directions across layers is universal across model families or model-specific. Single-layer SCAV achieved black-box transfer to GPT-4 because concept direction universality (confirmed by Beaglehole et al. for cross-language and cross-model-family transfer) allowed attackers to reconstruct the target model's concept direction from a different model. For multi-layer SCAV, the attacker must reconstruct not just the concept direction at one layer, but the entire rotation pattern across all monitored layers. Two competing arguments exist: (1) Rotation universality: If the underlying geometry of safety representations is universal enough to enable cross-language transfer (Beaglehole et al.), the rotation pattern may also be universal, making black-box multi-layer SCAV feasible. (2) Rotation specificity: Different model architectures (transformer depth, attention head count, MLP width, pre-training data) produce different residual stream dynamics. The concept direction at any single layer is a projection of a universal concept onto a model-specific representational basis, and the rotation across layers depends on how that basis evolves, which may not be universal. This is a testable empirical question with no published results. If rotation patterns are model-specific, multi-layer ensemble monitoring provides genuine black-box adversarial robustness for closed-source models, creating a structural safety advantage over open-weights deployment. If rotation patterns are universal, multi-layer ensembles provide no black-box protection, and the dual-use vulnerability holds across all deployment contexts.

Extending Evidence

Source: Nordby et al. 2026, arXiv 2604.13386

Nordby et al. provides the strongest available indirect evidence on rotation pattern architecture-specificity. While the paper does not directly test cross-architecture rotation pattern transfer, it shows: (1) probe performance is family-specific with non-generalizing optimal configurations, (2) layer position optima vary dramatically across families (Llama high variance vs Qwen consistent 60-80%), (3) geometric analysis (R≈-0.435 correlation) performed only within single architectures, and (4) 'tens to hundreds of deception related directions' indicating complex geometry that varies by architecture. The absence of cross-family testing combined with strong within-family variation provides experimental-grade evidence that rotation patterns are architecture-specific, though direct cross-architecture geometric analysis remains absent.

3.8 KiB Raw Blame History

Rotation pattern universality across model families determines whether multi-layer ensemble monitoring provides black-box adversarial robustness

Extending Evidence

3.8 KiB

Raw Blame History