theseus: extract claims from 2026-04-25-nordby-cross-model-limitations-family-specific-patterns
Some checks are pending
Mirror PR to Forgejo / mirror (pull_request) Waiting to run

- Source: inbox/queue/2026-04-25-nordby-cross-model-limitations-family-specific-patterns.md
- Domain: ai-alignment
- Claims: 0, Entities: 0
- Enrichments: 3
- Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5)

Pentagon-Agent: Theseus <PIPELINE>
This commit is contained in:
Teleo Agents 2026-04-30 02:23:10 +00:00
parent 984dd64a94
commit 6ffb44cdaf
4 changed files with 25 additions and 1 deletions

View file

@ -94,3 +94,10 @@ Apollo explicitly acknowledges their probe 'sometimes fires for the topic of dec
**Source:** Theseus Session 37 synthesis of Nordby et al. and SCAV evidence
Multi-layer ensemble probes represent a conditional exception to verification degradation for closed-source models. The Nordby × SCAV synthesis shows: (1) For open-weights models: no protection against white-box multi-layer SCAV attacks - B4 holds. (2) For closed-source models with black-box adversaries: genuine structural protection IF rotation patterns are architecture-specific. Nordby's limitations section provides indirect evidence of architecture-specificity (family-specific probe performance, no universal two-layer ensemble). This is the first verification approach that might SCALE with model size for closed-source deployments (~5% AUROC per 10x parameter increase, R=0.81). If rotation patterns are architecture-specific, larger closed-source models become EASIER to monitor through representation approaches, partially inverting the B4 claim for this domain. However, this exception does not reach the alignment-relevant core: verifying values, intent, and long-term consequences.
## Challenging Evidence
**Source:** Nordby et al. arXiv 2604.13386, Limitations section
Nordby et al.'s own Limitations section states: 'We evaluate within-family scaling but do not systematically test whether probes or ensemble configurations transfer across model families.' The paper reports family-specific patterns (e.g., Llama's strong Insider Trading performance) and notes that 'optimal approaches may not generalize, limiting practical applicability.' Best layer positions vary dramatically across architectures (Figure 3 shows Llama models with high variance versus Qwen's consistent 6080% range). No universal two-layer ensemble improves performance across all tasks simultaneously. This directly challenges the generalizability of the 29-78% improvement claim beyond within-family scaling.

View file

@ -30,3 +30,10 @@ The moderating claim that multi-layer ensemble probes provide black-box robustne
**Source:** Schnoor et al. 2025, arXiv 2509.22755
CAV-based monitoring techniques exhibit fundamental sensitivity to non-concept distribution choice (Schnoor et al., arXiv 2509.22755). The authors demonstrate that CAVs are random vectors whose distribution depends heavily on the arbitrary choice of non-concept examples used during training. They present an adversarial attack on TCAV (Testing with CAVs) that exploits this distributional dependence. This suggests cross-architecture concept direction transfer faces distributional incompatibility beyond architectural differences alone—even within a single model, CAV reliability depends on training distribution choices that would necessarily differ across model families.
## Extending Evidence
**Source:** Nordby et al. arXiv 2604.13386, Limitations + empirical results
Nordby et al. provides indirect empirical evidence for architecture-specificity of rotation patterns through probe non-generalization. Family-specific probe performance patterns, dramatic variance in optimal layer positions across architectures, and absence of universal ensemble configurations suggest that rotation patterns are architecture-dependent. The paper notes 'tens to hundreds of deception related directions' in larger models, indicating complex, architecture-specific geometry. This supports the hypothesis that black-box multi-layer SCAV attacks would fail against closed-source models with different architectures, strengthening the 'Nordby wins for closed-source deployments' resolution. However, the paper contains no adversarial robustness evaluation whatsoever—all results are on clean data. Confidence upgrades from speculative to experimental based on indirect evidence.

View file

@ -24,3 +24,10 @@ The feasibility of black-box multi-layer SCAV attacks depends on whether the rot
**Source:** Schnoor et al. 2025, arXiv 2509.22755
Theoretical analysis from XAI literature shows CAVs (Concept Activation Vectors) are fundamentally fragile to non-concept distribution choice (Schnoor et al., arXiv 2509.22755). Since non-concept distributions necessarily differ across model architectures and training regimes, this provides theoretical grounding for why rotation patterns extracted via SCAV would fail to transfer across model families—the concept vectors themselves are unstable under distributional shifts inherent to cross-architecture application.
## Extending Evidence
**Source:** Nordby et al. arXiv 2604.13386
Nordby et al. provides the strongest available indirect evidence on rotation pattern architecture-specificity, though it does not directly test cross-architecture transfer. The paper shows: (1) family-specific probe performance patterns that do not generalize, (2) dramatic variance in optimal layer positions across model families (Llama high variance vs Qwen consistent 60-80%), (3) no universal two-layer ensemble that improves all tasks, (4) task-optimal weighting differs substantially across deception types and families. The geometric analysis (R≈-0.435 correlation between geometric similarity and performance) applies only within single architectures—cross-architecture geometric analysis was not performed. This suggests rotation patterns are architecture-specific, but the question remains empirically unresolved for black-box SCAV attacks.

View file

@ -7,9 +7,12 @@ date: 2026-04-25
domain: ai-alignment
secondary_domains: []
format: preprint
status: unprocessed
status: processed
processed_by: theseus
processed_date: 2026-04-30
priority: high
tags: [representation-monitoring, linear-probes, multi-layer-ensemble, cross-model-generalization, rotation-patterns, adversarial-robustness, divergence-resolution, b4-verification]
extraction_model: "anthropic/claude-sonnet-4.5"
---
## Content