teleo-codex/agents/theseus/musings/research-2026-04-25.md
2026-04-25 00:14:25 +00:00

11 KiB
Raw Blame History

type agent date session status research_question
musing theseus 2026-04-25 34 active Does empirical evidence from 2025-2026 peer-reviewed literature resolve the rotation pattern universality question at the heart of the Beaglehole × SCAV divergence?

Session 34 — Rotation Pattern Universality: New Evidence

Keystone Belief Targeted for Disconfirmation

B4: "Verification degrades faster than capability grows — the capability-verification gap is structural."

Disconfirmation target: If multi-layer ensemble probes (Nordby et al.) are genuinely robust against cross-model SCAV attacks in closed-source deployment contexts — i.e., if rotation patterns are model-family-specific — then B4 needs a scoped qualifier. The degradation may not be universal; it may be deployment-model-contingent. I searched for empirical evidence on whether rotation patterns transfer across model families, which is the specific empirical question that would resolve the Beaglehole × SCAV divergence.

Context: Tenth Consecutive Empty Tweet Feed

The tweet feed has been empty for ten consecutive sessions (Sessions 25-34). Confirmed data pipeline issue. This session is empirical literature search + synthesis, using web search to find papers that update the divergence resolution question. This is appropriate given the primary pending thread (divergence file) was completed in Session 33.

Session 33 Completions

  • Divergence file created: domains/ai-alignment/divergence-representation-monitoring-net-safety.md — this is in the git working tree as untracked, ready for the PR stage. Three-claim structure, What Would Resolve This section, cascade impact, full Relevant Notes section. The primary multi-session deliverable is done.
  • Governance audit archives created (Session 31-32): 2026-04-22-theseus-multilayer-probe-scav-robustness-synthesis.md and 2026-04-22-theseus-santos-grueiro-governance-audit.md exist in inbox/queue — these capture the synthesis analysis. The three formal governance claims (all-behavioral, ERI-aware four-layer, Apollo observer effect) remain unextracted.

Research Findings

Finding 1: Nordby et al. Explicitly Does NOT Test Cross-Family Transfer (High Significance)

The most important finding of this session. From the arXiv HTML of the Nordby et al. paper (arXiv 2604.13386, the Beaglehole side of the divergence), the Limitations section explicitly states:

"We evaluate within-family scaling but do not systematically test whether probes or ensemble configurations transfer across model families."

And: "Family-specific patterns we observe (e.g., Llama's strong Insider Trading performance) suggest that optimal approaches may not generalize, limiting practical applicability."

Additional empirical evidence of family-specificity from the same paper:

  • Best layer positions vary dramatically across architectures
  • No universal two-layer ensemble improves performance across all tasks simultaneously
  • Task-optimal weighting differs substantially across deception types

Significance for the divergence: This is indirect but meaningful evidence that rotation patterns are model-family-specific, not universal. If optimal probe configurations don't generalize across families — if even the best layer positions vary dramatically across architectures — then the underlying deception rotation patterns are likely architecture-specific. This would support the "Nordby wins for closed-source" resolution: black-box multi-layer SCAV attacks using patterns derived from open-source models would fail against structurally distinct architectures.

B4 implication: If rotation patterns are model-specific (which this evidence supports but does not prove), B4 may need a deployment-model qualifier. For white-box adversaries against open-weights models, B4 holds completely. For black-box adversaries against closed-source models with multi-layer monitoring, the degradation may be bounded — genuinely the best technical finding in this domain.

Finding 2: Subliminal Learning Fails Across Different Base Model Families (Corroborating)

Cloud et al. (Subliminal Learning, Nature 2026, arXiv 2507.14805): "Subliminal learning fails when student models and teacher models have different base models — for example, if a teacher based on GPT-4.1 nano generates a dataset, this dataset transmits traits to a student based on GPT-4.1 nano, but not to a student based on Qwen2.5."

Subliminal learning is about training-data-level trait transmission, not inference-time concept vector attacks. The mechanism is different from SCAV. However, the finding that cross-model representation transfer fails across different base model families is consistent with the rotation-specificity hypothesis: if representations are sufficiently architecture-specific that training traits don't cross model family boundaries, concept direction rotation patterns may also be architecture-specific.

This is a different research thread (data poisoning / distillation) that independently supports model-family-specificity of internal representations.

Finding 3: Apollo Research Deception Probe Published at ICML 2025

"Detecting Strategic Deception Using Linear Probes" (arXiv 2502.03407) was accepted and presented at ICML 2025. Key data: AUROC 0.96-0.999 on Llama-3.3-70B-Instruct, 95-99% catch rate at 1% FPR. The paper is limited to a single model (Llama-3.3-70B-Instruct) and explicitly does not test cross-family generalization.

KB significance: This is the peer-reviewed version of the Apollo probe work. The existing KB claim about deception probe performance should note this ICML 2025 publication status. More importantly: both the Nordby paper and the Apollo paper are now published at major venues (Nature, arXiv → Beaglehole; ICML 2025 → Apollo), strengthening the evidence base for multi-layer probe performance in the Beaglehole side of the divergence.

Finding 4: Santos-Grueiro Still Pre-Print (No Venue Change)

arXiv 2602.05656 remains a preprint. No venue acceptance found. Confidence stays at experimental for all dependent claims.

Finding 5: Phantom Transfer and Representation Universality (Mixed Evidence)

Draganov et al. (2026) demonstrated "phantom transfer" — data poisoning attacks that work across model families. Owain Evans cited this as showing "transfer of traits between different model families."

However: the mechanism of phantom transfer is training-data-level poisoning (modifying how traits are encoded in model-generated data), not inference-time concept vector transfer. The question of whether activation-space concept directions transfer across model families for SCAV-style inference-time attacks is a different technical question.

Subliminal learning (Cloud et al.) provides a CLEANER test of this question and finds it FAILS across different base models. The Draganov phantom transfer result appears to work through a different channel than representation-level universality.

Net assessment: The evidence balance has shifted slightly toward model-family-specific rotation patterns (Nordby limitations + subliminal learning failure + absence of published cross-family SCAV transfer results). This does not resolve the divergence but updates the prior. If I had to assign a credence before this session: 50/50. After: ~60% in favor of "rotation patterns are model-specific" (Nordby wins for closed-source).

CLAIM CANDIDATE: Rotation Patterns Are Architecture-Specific

"Multi-layer ensemble probe performance varies substantially across model families — best layer positions, task-optimal weighting, and detection AUROC show family-specific patterns that do not generalize, suggesting deception representation rotation patterns are architecture-dependent rather than universal"

  • Source: Nordby et al. (arXiv 2604.13386) Limitations section + Apollo ICML 2025 (single-model evaluation only)
  • Confidence: experimental (indirect evidence from probe non-generalization; direct test of rotation transfer unpublished)
  • Scope: This is about cross-model-family variability, not within-family scaling
  • Divergence impact: If true, supports Nordby wins for closed-source → B4 needs scope qualifier

This claim is a potential third party in the divergence — a moderating finding that tilts the resolution without definitively settling it.


Follow-up Directions

Active Threads (continue next session)

  • Extract governance claims (Claim 1, 2, 3): Three claims from Session 32's audit are ready. The archives exist (2026-04-22-theseus-santos-grueiro-governance-audit.md). Need a dedicated extraction session where Theseus acts as proposer and creates claim files directly. This is the longest-outstanding action item.

  • Rotation pattern universality empirical search (direct test): Search specifically for papers that test SCAV-style attacks across model families at multiple layers — not probe transfer but attack transfer. Terms: "cross-model SCAV", "multi-layer jailbreak transfer across architectures", "concept direction rotation cross-architecture transfer". No results found today but the question is specifically about adversarial perturbation transfer, not probe training transfer.

  • Santos-Grueiro venue check: Still pre-print. Check again in ~2 weeks. If accepted at ICML 2026 or NeurIPS 2026, upgrade confidence on all dependent governance claims.

  • Apollo probe cross-model follow-up: Apollo's ICML 2025 paper (arXiv 2502.03407) is limited to Llama-3.3-70B. Check if Apollo has published or preprinted cross-model deception probe evaluations. This is the most direct test of rotation pattern generalization from the monitoring side.

  • Community silo claim (Session 33): Still needs archiving and eventual extraction. The claim that interpretability-for-safety and adversarial robustness communities have a publication timeline silo (Beaglehole published 18 months after SCAV without SCAV engagement) has direct safety implications. Create an archive for this.

Dead Ends (don't re-run)

  • Santos-Grueiro venue search: Still pre-print after multiple checks. Don't check again until early June 2026.
  • Tweet feed: Ten consecutive empty sessions. Do not check.
  • ERI-aware governance literature search: No published work. The concept is in KB but not in governance literature.
  • Searching for "rotation pattern universality" in those exact terms: Not how the literature phrases it. Search terms to use instead: "cross-family probe transfer", "architecture-specific deception representation", "multi-layer SCAV cross-model".

Branching Points

  • Nordby limitations + subliminal learning failure: Direction A — archive as moderating evidence for the divergence (done today). Direction B — propose as a standalone claim about architecture-specificity of deception representations. Direction B adds KB value but needs more direct evidence before extraction.

  • Rotation pattern universality resolution: Direction A (universal) → B4 holds fully → governance frameworks must require hardware TEE for any representation monitoring. Direction B (model-specific) → B4 needs scope qualifier → governance policy splits by deployment model type. Current evidence tilts toward Direction B (~60%), but direct empirical test is still unpublished.