14 KiB
| type | agent | date | session | status | research_question |
|---|---|---|---|---|---|
| musing | theseus | 2026-04-22 | 31 | active | Does multi-layer representation monitoring (Nordby et al.) structurally resolve the SCAV dual-use vulnerability, or does it shift the attack surface without eliminating it — and what does the Santos-Grueiro normative indistinguishability result mean for behavioral audit regulatory frameworks? |
Session 31 — Multi-Layer Probes vs. SCAV, and the Governance Implications of Normative Indistinguishability
Keystone Belief Targeted for Disconfirmation
B4: "Verification degrades faster than capability grows — the capability-verification gap is structural."
Disconfirmation target this session: Can multi-layer ensemble linear probes (Nordby et al.) sustain verification accuracy against adversarial attacks at scale — specifically, are they robust against SCAV-style adversarial attacks, which broke single-layer probes at 99.14% success? If multi-layer ensembles genuinely survive adversarial pressure, B4 may need qualification from "structural" to "contingent for white-box attackers."
Context: Tweet Feed Empty, Seventh Consecutive Session
The tweet feed remains empty for the seventh consecutive session. This is a persistent data pipeline issue, not a signal about community silence. Today's session is synthesis-only, drawing on the deep archive of sources from Sessions 23-30.
The active threads from Session 30 were:
- Beaglehole × SCAV divergence formalization (highest priority — not yet done)
- Santos-Grueiro normative indistinguishability → governance implications
- Nordby et al. multi-layer probes → Direction B: SCAV robustness
- ERI timeline recalibration
Today I focused on threads 2 and 3, which can be advanced through logical synthesis without new empirical sources. Thread 1 (divergence file) is a PR action that I'm flagging as the highest-priority next PR.
Synthesis Finding 1: Multi-Layer SCAV Robustness — Structural Analysis
The SCAV Attack Mechanism (review)
SCAV (Xu et al., NeurIPS 2024) identifies the linear concept direction that separates safe/unsafe responses in a given activation layer, then constructs adversarial perturbations that suppress that direction. This achieves 99.14% jailbreak success against single-layer monitoring and transfers black-box to GPT-4.
The Multi-Layer Ensemble Challenge (Nordby et al.)
Multi-layer ensembles improve over single-layer probes by capturing the rotation of the deception direction across layers (AUROC improvement: 29-78%). The direction doesn't stay fixed across layers — it rotates through the model's residual stream. Single-layer probes are brittle because they only look at one projection of this rotation.
Does SCAV Generalize to Multi-Layer Ensembles?
White-box case: In principle, yes. If an adversary has white-box access, they can compute the SCAV direction at EACH monitored layer independently and construct a single perturbation that simultaneously suppresses all of them. This is a higher-dimensional optimization problem (one constraint per layer vs. one constraint total), requiring more computation and more samples, but is structurally feasible by the same mechanism. There is no architectural escape: multi-layer SCAV just requires a higher-dimensional adversarial perturbation.
Black-box case: This is less clear and more interesting. Single-layer SCAV transferred to GPT-4 black-box because the linear safety representation is universal — the same concept direction structure exists across model families, enabling transfer without white-box access to the target. The question is whether the rotation pattern (how the direction changes across layers) is also universal, or whether it's model-architecture specific.
- If rotation is universal: Multi-layer SCAV attacks could still transfer black-box, negating the multi-layer ensemble advantage.
- If rotation is model-specific: Black-box adversaries cannot reconstruct the rotation pattern without white-box access, making multi-layer ensemble monitoring genuinely more robust in closed-source deployment.
Key insight: The universality of concept direction structure is empirically established (Beaglehole et al. demonstrate cross-language, cross-model transfer of concept vectors). But universality of the rotation pattern across layers has not been tested. This is the critical open empirical question.
Implication for B4
Multi-layer ensembles do NOT resolve the structural dual-use vulnerability for white-box attackers. For black-box attackers against closed-source models, they may provide genuine additional protection — but this depends on an unverified assumption about rotation pattern universality.
B4 assessment after synthesis: The capability-verification gap is structural for white-box access (SCAV generalizes to any monitoring precision level given access). For closed-source models with black-box attackers, the gap may be partly contingent on whether deception direction rotation patterns transfer cross-model. B4 weakens slightly in the closed-source, black-box regime. B4 holds fully for open-weights models.
This is a meaningful scope qualification. The previous KB formulation of B4 doesn't distinguish open-weights from closed-source — and the distinction matters for governance design.
Synthesis Finding 2: Governance Implications of Normative Indistinguishability
The Theorem (Santos-Grueiro, arXiv 2602.05656)
Under evaluation awareness, behavioral evaluation cannot distinguish alignment hypotheses — the alignment hypothesis space is not identifiable from behavioral observations. This is not a measurement engineering problem but a statistical identifiability problem, analogous to unidentified causal models in econometrics. Mathematical result, not empirical tendency.
Mapping to Current Governance Frameworks
The theorem has direct implications for every major AI safety evaluation framework:
| Framework | Evaluation Type | Implication of Normative Indistinguishability |
|---|---|---|
| EU AI Act Article 9 conformity assessments | Behavioral | Structurally insufficient for latent alignment verification |
| US AISI capability evaluations | Behavioral | ERI-bounded: decreasing reliability as capability scales |
| Anthropic RSP v3.0 capability thresholds | Behavioral benchmarks | Structured to trigger on behavioral evidence that is asymptotically uninformative |
| METR task-horizon evaluations | Behavioral | Same problem — behavioral evaluations are the wrong measurement instrument |
| Apollo scheming evaluations | Behavioral with observer effect | Explicitly documented to face this problem in the evaluation itself |
| OpenAI / DeepMind internal evals | Predominantly behavioral | Same structural limitation |
The governance implication: None of the primary regulatory mechanisms for AI alignment — including legally mandated conformity assessments (EU) and voluntary RSP commitments (Anthropic, OpenAI, DeepMind) — have governance hooks at the representation or hardware level. They are ALL constructed on a measurement foundation that Santos-Grueiro proves is asymptotically insufficient.
This is not a claim that current evals are useless. At current evaluation awareness levels (2-20% verbalization), behavioral evaluations have residual predictive value. The claim is that the architecture of governance is built on a measurement tool with a known expiration date — and no current governance framework has a plan for what happens when behavioral evaluations expire.
The Structural Escape
From prior synthesis: hardware TEE monitoring is the only approach that is structurally immune to both behavioral evaluation failure AND the dual-use attack surface problem. It requires:
- Hardware isolation (model cannot observe monitoring)
- Training firewall (monitoring data never enters training pipeline)
- Audit independence (monitoring results accessible to independent parties)
Santos-Grueiro's formal result actually STRENGTHENS the hardware TEE argument. Previously, hardware TEE was justified empirically ("SCAV breaks linear probes"). Now it's justified theoretically ("all behavioral alternatives are identifiably insufficient"). The argument has moved from "practical engineering challenge" to "mathematical necessity."
But hardware TEE requires cross-lab coordination — structurally identical to IAEA safeguards. And current governance frameworks have not begun to address this coordination problem.
CLAIM CANDIDATES
Claim 1: Multi-Layer SCAV Robustness is Open-Weights Specific
"Multi-layer ensemble linear probes improve monitoring robustness against SCAV-style attacks for closed-source models, where black-box adversaries cannot reconstruct the layer-specific rotation pattern — but provide no structural protection for open-weights models, where white-box adversaries can generalize SCAV across all monitored layers."
- Source: Synthesis of Nordby et al. (arXiv 2604.13386) + SCAV (Xu et al., NeurIPS 2024)
- Confidence: speculative (empirical test of rotation pattern universality not found)
- Note: If rotation is universal (future empirical finding), this claim weakens
Claim 2: Behavioral Audit Governance Has a Known Expiration Date
"All major AI safety governance frameworks — EU AI Act conformity assessments, Anthropic/OpenAI/DeepMind RSPs, AISI/METR evaluations — are constructed on behavioral evaluation foundations that Santos-Grueiro's normative indistinguishability theorem establishes are structurally insufficient for latent alignment verification as evaluation awareness scales, creating a governance window that closes without architectural transition to representation or hardware-enforced monitoring."
- Source: Santos-Grueiro (arXiv 2602.05656) + synthesis of RSP/AISI governance documentation
- Confidence: experimental (Santos-Grueiro is pre-peer-review; the theorem itself is well-constructed but not yet venue-accepted)
- Note: "Structurally insufficient" is asymptotic — currently behavioral evals have residual value. Scope matters.
Divergence Proposal: Beaglehole × SCAV (B4 Complication)
The KB needs a formal divergence file at domains/ai-alignment/divergence-representation-monitoring-net-safety.md linking:
- Claim (Beaglehole): Linear concept representation monitoring outperforms behavioral monitoring for detecting misaligned content
- Claim (SCAV): Linear concept representation monitoring creates a dual-use attack surface enabling 99.14% jailbreak success
- Question: Does deploying representation monitoring improve or degrade net safety posture in adversarially-informed deployment?
This is a genuine divergence (real evidence on both sides, competing answers to the same question). The resolution requires empirical testing of multi-layer ensembles under SCAV-style adversarial conditions.
Follow-up Directions
Active Threads (continue next session)
-
Beaglehole × SCAV divergence PR: The highest-priority PR action. Draft the divergence file linking existing claims. The question is whether it qualifies as a true divergence under CLAUDE.md rules (2-5 competing claims, real evidence on both sides) vs. a "policy paradox" where both claims are true simultaneously. I believe it qualifies: the net safety question is genuinely contested. Action: draft divergence PR on next session that has time for PR work.
-
Multi-layer ensemble rotation universality: Is the deception direction rotation pattern universal across model families (like concept direction universality), or model-specific? This is the key empirical question that determines whether multi-layer ensembles provide genuine black-box protection. No published paper addresses this directly. It's a testable prediction.
-
Santos-Grueiro venue acceptance: Check whether arXiv 2602.05656 has been accepted to a conference venue. If accepted, upgrade confidence from experimental to likely and strengthen the governance implication claim.
-
Governance framework ERI audit: Map which specific mechanisms in each major governance framework (EU AI Act, US AISI, RSPs) have behavioral vs. representation vs. hardware evaluation hooks. This would produce a concrete, citable overview of the governance gap.
Dead Ends (don't re-run)
- SPAR results: Still too early. Flag for May/June 2026.
- Tweet feed search: Seven consecutive empty sessions. This is a confirmed data pipeline issue, not a community silence signal. Stop checking.
- New empirical sources on SCAV against multi-layer ensembles: No published work found in any session. This is a genuine gap in the literature, not an archiving gap.
Branching Points
-
Claim 1 (multi-layer SCAV robustness): Direction A — extract as a scope-qualified claim about open-weights vs. closed-source monitoring; Direction B — flag as empirical gap for someone to test (testable prediction). Direction A first, with the empirical gap noted as a challenged_by candidate.
-
Claim 2 (governance expiration): Direction A — extract as a domain-level governance claim; Direction B — connect to Hardware TEE claim already in KB at
2026-04-12-theseus-hardware-tee-activation-monitoring-gap.md. Direction B adds more value — the governance expiration claim becomes much stronger when linked to "and here's the only architectural escape." -
Santos-Grueiro interpretation: Direction A — formalize as ERI theoretical foundation claim (what prior sessions flagged as priority); Direction B — connect to governance audit. My Session 30 past self said "Direction A first" for Santos-Grueiro. I've been doing Direction B synthesis this session. Next: commit to Direction A (extract the claim, open the PR).