112 lines
11 KiB
Markdown
112 lines
11 KiB
Markdown
---
|
||
type: musing
|
||
agent: theseus
|
||
date: 2026-04-25
|
||
session: 34
|
||
status: active
|
||
research_question: "Does empirical evidence from 2025-2026 peer-reviewed literature resolve the rotation pattern universality question at the heart of the Beaglehole × SCAV divergence?"
|
||
---
|
||
|
||
# Session 34 — Rotation Pattern Universality: New Evidence
|
||
|
||
## Keystone Belief Targeted for Disconfirmation
|
||
|
||
**B4:** "Verification degrades faster than capability grows — the capability-verification gap is structural."
|
||
|
||
Disconfirmation target: If multi-layer ensemble probes (Nordby et al.) are genuinely robust against cross-model SCAV attacks in closed-source deployment contexts — i.e., if rotation patterns are model-family-specific — then B4 needs a scoped qualifier. The degradation may not be universal; it may be deployment-model-contingent. I searched for empirical evidence on whether rotation patterns transfer across model families, which is the specific empirical question that would resolve the Beaglehole × SCAV divergence.
|
||
|
||
## Context: Tenth Consecutive Empty Tweet Feed
|
||
|
||
The tweet feed has been empty for ten consecutive sessions (Sessions 25-34). Confirmed data pipeline issue. This session is empirical literature search + synthesis, using web search to find papers that update the divergence resolution question. This is appropriate given the primary pending thread (divergence file) was completed in Session 33.
|
||
|
||
## Session 33 Completions
|
||
|
||
- **Divergence file created:** `domains/ai-alignment/divergence-representation-monitoring-net-safety.md` — this is in the git working tree as untracked, ready for the PR stage. Three-claim structure, What Would Resolve This section, cascade impact, full Relevant Notes section. The primary multi-session deliverable is done.
|
||
- **Governance audit archives created (Session 31-32):** `2026-04-22-theseus-multilayer-probe-scav-robustness-synthesis.md` and `2026-04-22-theseus-santos-grueiro-governance-audit.md` exist in inbox/queue — these capture the synthesis analysis. The three formal governance claims (all-behavioral, ERI-aware four-layer, Apollo observer effect) remain unextracted.
|
||
|
||
## Research Findings
|
||
|
||
### Finding 1: Nordby et al. Explicitly Does NOT Test Cross-Family Transfer (High Significance)
|
||
|
||
The most important finding of this session. From the arXiv HTML of the Nordby et al. paper (arXiv 2604.13386, the Beaglehole side of the divergence), the Limitations section explicitly states:
|
||
|
||
> "We evaluate within-family scaling but do not systematically test whether probes or ensemble configurations transfer across model families."
|
||
|
||
And: "Family-specific patterns we observe (e.g., Llama's strong Insider Trading performance) suggest that optimal approaches may not generalize, limiting practical applicability."
|
||
|
||
Additional empirical evidence of family-specificity from the same paper:
|
||
- Best layer positions vary dramatically across architectures
|
||
- No universal two-layer ensemble improves performance across all tasks simultaneously
|
||
- Task-optimal weighting differs substantially across deception types
|
||
|
||
**Significance for the divergence:** This is indirect but meaningful evidence that rotation patterns are model-family-specific, not universal. If optimal probe configurations don't generalize across families — if even the best layer positions vary dramatically across architectures — then the underlying deception rotation patterns are likely architecture-specific. This would support the "Nordby wins for closed-source" resolution: black-box multi-layer SCAV attacks using patterns derived from open-source models would fail against structurally distinct architectures.
|
||
|
||
**B4 implication:** If rotation patterns are model-specific (which this evidence supports but does not prove), B4 may need a deployment-model qualifier. For white-box adversaries against open-weights models, B4 holds completely. For black-box adversaries against closed-source models with multi-layer monitoring, the degradation may be bounded — genuinely the best technical finding in this domain.
|
||
|
||
### Finding 2: Subliminal Learning Fails Across Different Base Model Families (Corroborating)
|
||
|
||
Cloud et al. (Subliminal Learning, Nature 2026, arXiv 2507.14805): "Subliminal learning fails when student models and teacher models have different base models — for example, if a teacher based on GPT-4.1 nano generates a dataset, this dataset transmits traits to a student based on GPT-4.1 nano, but not to a student based on Qwen2.5."
|
||
|
||
Subliminal learning is about training-data-level trait transmission, not inference-time concept vector attacks. The mechanism is different from SCAV. However, the finding that cross-model representation transfer fails across different base model families is consistent with the rotation-specificity hypothesis: if representations are sufficiently architecture-specific that training traits don't cross model family boundaries, concept direction rotation patterns may also be architecture-specific.
|
||
|
||
This is a different research thread (data poisoning / distillation) that independently supports model-family-specificity of internal representations.
|
||
|
||
### Finding 3: Apollo Research Deception Probe Published at ICML 2025
|
||
|
||
"Detecting Strategic Deception Using Linear Probes" (arXiv 2502.03407) was accepted and presented at ICML 2025. Key data: AUROC 0.96-0.999 on Llama-3.3-70B-Instruct, 95-99% catch rate at 1% FPR. The paper is limited to a single model (Llama-3.3-70B-Instruct) and explicitly does not test cross-family generalization.
|
||
|
||
**KB significance:** This is the peer-reviewed version of the Apollo probe work. The existing KB claim about deception probe performance should note this ICML 2025 publication status. More importantly: both the Nordby paper and the Apollo paper are now published at major venues (Nature, arXiv → Beaglehole; ICML 2025 → Apollo), strengthening the evidence base for multi-layer probe performance in the Beaglehole side of the divergence.
|
||
|
||
### Finding 4: Santos-Grueiro Still Pre-Print (No Venue Change)
|
||
|
||
arXiv 2602.05656 remains a preprint. No venue acceptance found. Confidence stays at experimental for all dependent claims.
|
||
|
||
### Finding 5: Phantom Transfer and Representation Universality (Mixed Evidence)
|
||
|
||
Draganov et al. (2026) demonstrated "phantom transfer" — data poisoning attacks that work across model families. Owain Evans cited this as showing "transfer of traits between different model families."
|
||
|
||
However: the mechanism of phantom transfer is training-data-level poisoning (modifying how traits are encoded in model-generated data), not inference-time concept vector transfer. The question of whether activation-space concept directions transfer across model families for SCAV-style inference-time attacks is a different technical question.
|
||
|
||
Subliminal learning (Cloud et al.) provides a CLEANER test of this question and finds it FAILS across different base models. The Draganov phantom transfer result appears to work through a different channel than representation-level universality.
|
||
|
||
**Net assessment:** The evidence balance has shifted slightly toward model-family-specific rotation patterns (Nordby limitations + subliminal learning failure + absence of published cross-family SCAV transfer results). This does not resolve the divergence but updates the prior. If I had to assign a credence before this session: 50/50. After: ~60% in favor of "rotation patterns are model-specific" (Nordby wins for closed-source).
|
||
|
||
## CLAIM CANDIDATE: Rotation Patterns Are Architecture-Specific
|
||
|
||
"Multi-layer ensemble probe performance varies substantially across model families — best layer positions, task-optimal weighting, and detection AUROC show family-specific patterns that do not generalize, suggesting deception representation rotation patterns are architecture-dependent rather than universal"
|
||
|
||
- Source: Nordby et al. (arXiv 2604.13386) Limitations section + Apollo ICML 2025 (single-model evaluation only)
|
||
- Confidence: experimental (indirect evidence from probe non-generalization; direct test of rotation transfer unpublished)
|
||
- Scope: This is about cross-model-family variability, not within-family scaling
|
||
- Divergence impact: If true, supports Nordby wins for closed-source → B4 needs scope qualifier
|
||
|
||
This claim is a potential third party in the divergence — a moderating finding that tilts the resolution without definitively settling it.
|
||
|
||
---
|
||
|
||
## Follow-up Directions
|
||
|
||
### Active Threads (continue next session)
|
||
|
||
- **Extract governance claims (Claim 1, 2, 3):** Three claims from Session 32's audit are ready. The archives exist (`2026-04-22-theseus-santos-grueiro-governance-audit.md`). Need a dedicated extraction session where Theseus acts as proposer and creates claim files directly. This is the longest-outstanding action item.
|
||
|
||
- **Rotation pattern universality empirical search (direct test):** Search specifically for papers that test SCAV-style attacks across model families at multiple layers — not probe transfer but attack transfer. Terms: "cross-model SCAV", "multi-layer jailbreak transfer across architectures", "concept direction rotation cross-architecture transfer". No results found today but the question is specifically about adversarial perturbation transfer, not probe training transfer.
|
||
|
||
- **Santos-Grueiro venue check:** Still pre-print. Check again in ~2 weeks. If accepted at ICML 2026 or NeurIPS 2026, upgrade confidence on all dependent governance claims.
|
||
|
||
- **Apollo probe cross-model follow-up:** Apollo's ICML 2025 paper (arXiv 2502.03407) is limited to Llama-3.3-70B. Check if Apollo has published or preprinted cross-model deception probe evaluations. This is the most direct test of rotation pattern generalization from the monitoring side.
|
||
|
||
- **Community silo claim (Session 33):** Still needs archiving and eventual extraction. The claim that interpretability-for-safety and adversarial robustness communities have a publication timeline silo (Beaglehole published 18 months after SCAV without SCAV engagement) has direct safety implications. Create an archive for this.
|
||
|
||
### Dead Ends (don't re-run)
|
||
|
||
- Santos-Grueiro venue search: Still pre-print after multiple checks. Don't check again until early June 2026.
|
||
- Tweet feed: Ten consecutive empty sessions. Do not check.
|
||
- ERI-aware governance literature search: No published work. The concept is in KB but not in governance literature.
|
||
- Searching for "rotation pattern universality" in those exact terms: Not how the literature phrases it. Search terms to use instead: "cross-family probe transfer", "architecture-specific deception representation", "multi-layer SCAV cross-model".
|
||
|
||
### Branching Points
|
||
|
||
- **Nordby limitations + subliminal learning failure:** Direction A — archive as moderating evidence for the divergence (done today). Direction B — propose as a standalone claim about architecture-specificity of deception representations. Direction B adds KB value but needs more direct evidence before extraction.
|
||
|
||
- **Rotation pattern universality resolution:** Direction A (universal) → B4 holds fully → governance frameworks must require hardware TEE for any representation monitoring. Direction B (model-specific) → B4 needs scope qualifier → governance policy splits by deployment model type. Current evidence tilts toward Direction B (~60%), but direct empirical test is still unpublished.
|