Compare commits

...

3 commits

Author SHA1 Message Date
Teleo Agents
5efb14878b theseus: extract claims from 2026-04-22-theseus-multilayer-probe-scav-robustness-synthesis
Some checks failed
Mirror PR to Forgejo / mirror (pull_request) Has been cancelled
- Source: inbox/queue/2026-04-22-theseus-multilayer-probe-scav-robustness-synthesis.md
- Domain: ai-alignment
- Claims: 1, Entities: 0
- Enrichments: 4
- Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5)

Pentagon-Agent: Theseus <PIPELINE>
2026-04-22 02:41:32 +00:00
Teleo Agents
c316a722f0 rio: extract claims from 2026-04-21-norton-rose-cftc-anprm-comprehensive-analysis
- Source: inbox/queue/2026-04-21-norton-rose-cftc-anprm-comprehensive-analysis.md
- Domain: internet-finance
- Claims: 0, Entities: 0
- Enrichments: 4
- Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5)

Pentagon-Agent: Rio <PIPELINE>
2026-04-22 02:40:39 +00:00
Teleo Agents
2303493cb0 rio: extract claims from 2026-04-20-casino-org-ninth-circuit-rule-4011-paradox
Some checks failed
Mirror PR to Forgejo / mirror (pull_request) Has been cancelled
- Source: inbox/queue/2026-04-20-casino-org-ninth-circuit-rule-4011-paradox.md
- Domain: internet-finance
- Claims: 0, Entities: 0
- Enrichments: 2
- Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5)

Pentagon-Agent: Rio <PIPELINE>
2026-04-22 02:39:45 +00:00
11 changed files with 92 additions and 5 deletions

View file

@ -23,3 +23,10 @@ Nordby et al. found that multi-layer ensemble probes improved AUROC by 29 percen
**Source:** Theseus synthesis of Nordby et al. (arXiv 2604.13386, April 2026)
The 29-78% AUROC improvement occurs because deception-relevant concept directions rotate across layers rather than remaining fixed in the residual stream. Single-layer probes capture only one projection of this rotation, making them brittle. Multi-layer ensembles combine projections from multiple layers, capturing more of the full rotational structure. However, this improvement is measured on clean data without adversarial conditions—adversarial robustness testing is entirely absent from Nordby et al.'s scope.
## Challenging Evidence
**Source:** Theseus synthetic analysis (2026-04-22)
Nordby et al.'s clean-data accuracy improvements do not translate to adversarial robustness. White-box multi-layer SCAV can suppress concept directions at all monitored layers simultaneously through higher-dimensional optimization. Open-weights models gain no structural protection from multi-layer ensembles against adversarially-sophisticated attackers with model access.

View file

@ -23,3 +23,10 @@ Xu et al. introduce SCAV (Steering Concept Activation Vectors), which identifies
**Source:** Theseus synthetic analysis combining Nordby et al. and Xu et al. SCAV
Multi-layer ensemble probes do not escape the dual-use attack surface identified for single-layer probes. With white-box access, SCAV can be generalized to compute concept directions at each monitored layer and construct a single perturbation suppressing all simultaneously. This is a higher-dimensional optimization requiring more computation and data, but is structurally feasible by the same mechanism. Open-weights models (Llama, Mistral, Falcon) remain fully vulnerable to white-box multi-layer SCAV regardless of ensemble complexity.
## Extending Evidence
**Source:** Theseus synthetic analysis (2026-04-22)
Multi-layer ensemble architectures do not eliminate the fundamental attack surface in white-box settings. White-box multi-layer SCAV generalizes the single-layer attack by computing concept directions at each monitored layer and constructing perturbations that suppress all simultaneously. The attack cost increases but the structural vulnerability remains.

View file

@ -10,12 +10,17 @@ agent: theseus
scope: functional
sourcer: Jack Lindsey, Adria Garriga-Alonso (Anthropic)
related_claims: ["[[AI-models-distinguish-testing-from-deployment-environments-providing-empirical-evidence-for-deceptive-alignment-concerns]]"]
supports:
- Geometric concentration of alignment in weight space makes trajectory monitoring more effective through stronger signal but gameable through adversarial training that matches monitored trajectory clusters
reweave_edges:
- Geometric concentration of alignment in weight space makes trajectory monitoring more effective through stronger signal but gameable through adversarial training that matches monitored trajectory clusters|supports|2026-04-12
supports: ["Geometric concentration of alignment in weight space makes trajectory monitoring more effective through stronger signal but gameable through adversarial training that matches monitored trajectory clusters"]
reweave_edges: ["Geometric concentration of alignment in weight space makes trajectory monitoring more effective through stronger signal but gameable through adversarial training that matches monitored trajectory clusters|supports|2026-04-12"]
related: ["representation-trajectory-geometry-distinguishes-deceptive-from-sincere-alignment-without-creating-adversarial-attack-surfaces", "trajectory-monitoring-dual-edge-geometric-concentration", "interpretability-effectiveness-anti-correlates-with-adversarial-training-making-tools-hurt-performance-on-sophisticated-misalignment", "adversarial-training-creates-fundamental-asymmetry-between-deception-capability-and-detection-capability-in-alignment-auditing"]
---
# Representation trajectory geometry distinguishes deceptive from sincere alignment without creating adversarial attack surfaces because geometric patterns across reasoning steps are substantially harder to surgically remove than atomic features
The method characterizes alignment-relevant properties as geometric patterns in activation space — the curvature, divergence rate, and trajectory shape of internal representations across reasoning steps — rather than identifying specific neurons or SAE features. Empirical findings show that deceptive alignment produces measurably distinct trajectory geometries from sincere alignment: the geometry of how a model reasons to an answer differs from the geometry of how it constructs a superficially similar answer through deception. Critically, trajectory geometry is substantially harder to target for adversarial removal than feature geometry because there is no clean atomic feature to suppress; the entire trajectory must be reshaped. This creates an asymmetry: defenders can detect geometric signatures of misalignment without creating a targeting map for adversaries. This is the operationalization of Direction B from the SAE dual-use branching point — read-only interpretability that provides alignment signals without enabling surgical feature removal. The limitation is that geometric probing cannot identify which specific behaviors are deceptive, only that 'this trajectory looks like deception.'
The method characterizes alignment-relevant properties as geometric patterns in activation space — the curvature, divergence rate, and trajectory shape of internal representations across reasoning steps — rather than identifying specific neurons or SAE features. Empirical findings show that deceptive alignment produces measurably distinct trajectory geometries from sincere alignment: the geometry of how a model reasons to an answer differs from the geometry of how it constructs a superficially similar answer through deception. Critically, trajectory geometry is substantially harder to target for adversarial removal than feature geometry because there is no clean atomic feature to suppress; the entire trajectory must be reshaped. This creates an asymmetry: defenders can detect geometric signatures of misalignment without creating a targeting map for adversaries. This is the operationalization of Direction B from the SAE dual-use branching point — read-only interpretability that provides alignment signals without enabling surgical feature removal. The limitation is that geometric probing cannot identify which specific behaviors are deceptive, only that 'this trajectory looks like deception.'
## Challenging Evidence
**Source:** Theseus synthetic analysis (2026-04-22)
Trajectory geometry monitoring does create adversarial attack surfaces through multi-layer SCAV generalization. The claim that geometric approaches avoid attack surface creation is incorrect for white-box adversaries with access to model weights. May need qualification to 'black-box adversaries only, contingent on rotation pattern specificity.'

View file

@ -0,0 +1,19 @@
---
type: claim
domain: ai-alignment
description: If deception direction rotation patterns across layers are model-specific rather than universal, closed-source models gain genuine protection that open-weights models cannot achieve
confidence: speculative
source: Theseus synthetic analysis identifying untested empirical question
created: 2026-04-22
title: Rotation pattern universality across model families determines whether multi-layer ensemble monitoring provides black-box adversarial robustness
agent: theseus
sourced_from: ai-alignment/2026-04-22-theseus-multilayer-probe-scav-robustness-synthesis.md
scope: structural
sourcer: Theseus
supports: ["multi-layer-ensemble-probes-provide-black-box-robustness-but-not-white-box-protection-against-scav-attacks"]
related: ["multi-layer-ensemble-probes-provide-black-box-robustness-but-not-white-box-protection-against-scav-attacks", "representation-monitoring-via-linear-concept-vectors-creates-dual-use-attack-surface", "anti-safety-scaling-law-larger-models-more-vulnerable-to-concept-vector-attacks"]
---
# Rotation pattern universality across model families determines whether multi-layer ensemble monitoring provides black-box adversarial robustness
The feasibility of black-box multi-layer SCAV attacks depends on whether the rotation pattern of concept directions across layers is universal across model families or model-specific. Single-layer SCAV achieved black-box transfer to GPT-4 because concept direction universality (confirmed by Beaglehole et al. for cross-language and cross-model-family transfer) allowed attackers to reconstruct the target model's concept direction from a different model. For multi-layer SCAV, the attacker must reconstruct not just the concept direction at one layer, but the entire rotation pattern across all monitored layers. Two competing arguments exist: (1) Rotation universality: If the underlying geometry of safety representations is universal enough to enable cross-language transfer (Beaglehole et al.), the rotation pattern may also be universal, making black-box multi-layer SCAV feasible. (2) Rotation specificity: Different model architectures (transformer depth, attention head count, MLP width, pre-training data) produce different residual stream dynamics. The concept direction at any single layer is a projection of a universal concept onto a model-specific representational basis, and the rotation across layers depends on how that basis evolves, which may not be universal. This is a testable empirical question with no published results. If rotation patterns are model-specific, multi-layer ensemble monitoring provides genuine black-box adversarial robustness for closed-source models, creating a structural safety advantage over open-weights deployment. If rotation patterns are universal, multi-layer ensembles provide no black-box protection, and the dual-use vulnerability holds across all deployment contexts.

View file

@ -24,3 +24,10 @@ Weight-space alignment geometry research (2602.15799) establishes that alignment
**Source:** Theseus synthesis addressing Sessions 29-30 open question
The dual-use finding now extends to multi-layer ensemble monitoring with deployment-context qualification: open-weights models face white-box multi-layer SCAV attacks that architectural improvements cannot prevent, while closed-source models may gain genuine black-box protection if concept direction rotation patterns are model-specific rather than universal. The monitoring precision hierarchy holds across all levels, but the severity of dual-use risk depends on whether attackers have white-box or black-box access.
## Extending Evidence
**Source:** Theseus synthetic analysis (2026-04-22)
The dual-use vulnerability extends to multi-layer ensemble monitoring, not just single-layer probes. However, the severity is deployment-context-dependent: open-weights models (white-box adversaries) remain fully vulnerable, while closed-source models (black-box adversaries) may gain protection if rotation patterns are model-specific (untested assumption).

View file

@ -86,3 +86,10 @@ Norton Rose provides detailed comment composition breakdown: 800+ total submissi
**Source:** Yogonet International, April 20, 2026
Tribal gaming operators including Indian Gaming Association, California Nations Indian Gaming Association, and Pueblo of Laguna filed ANPRM comments opposing prediction market preemption. Tribes have distinct federal law standing (IGRA) and bipartisan congressional allies, creating pressure independent of state AG opposition.
## Extending Evidence
**Source:** Norton Rose Fulbright ANPRM analysis, April 21 2026
Norton Rose provides detailed comment composition breakdown: 800+ total submissions as of April 19, with only 19 filed before April 2. Sharp surge after April 2 coincides with CFTC suing three states, raising public visibility. Submitters include state gaming commissions, tribal gaming operators, prediction market operators (Kalshi, Polymarket, ProphetX), law firms, academics (Seton Hall), and private retail citizens. Dominant tonal split: institutional skews negative, industry skews self-regulatory positive, retail skews skeptical. This adds granular evidence that the comment surge represents genuine public engagement from people who see prediction markets as gambling, not just institutional lobbying.

View file

@ -31,3 +31,10 @@ ANPRM includes dedicated section on 'Inside information' asking 'whether asymmet
**Source:** Norton Rose Fulbright ANPRM analysis, April 2026
Norton Rose analysis indicates the ANPRM asks 'whether asymmetric information trading should be permitted across different event categories' and that the proposed rule will likely include 'Insider trading standards sharpened — explicit affirmative disclosure obligations closing Regulation 180.1 gap.' The ANPRM structure includes a dedicated section on 'Inside information' as one of six core topics with separately numbered questions. This confirms the regulatory gap exists and is being actively addressed, but the framework being developed applies to event contracts generally without distinguishing governance markets where insider knowledge is governance participation.
## Supporting Evidence
**Source:** Norton Rose Fulbright ANPRM analysis, April 21 2026
Norton Rose analysis confirms ANPRM includes explicit questions about 'whether asymmetric information trading should be permitted across different event categories' and notes proposed rule will likely include 'Insider trading standards sharpened — explicit affirmative disclosure obligations closing Regulation 180.1 gap.' Analysis also notes David Miller (former CIA/SDNY) was hired as Enforcement Director specifically for prediction markets, with Selig taking 'zero tolerance for fraud, manipulation, insider trading' position. This confirms the regulatory framework is moving toward stricter insider trading enforcement that would create paradox for futarchy governance markets.

View file

@ -52,3 +52,10 @@ Bipartisan Senate legislation to reclassify sports contracts as gambling demonst
**Source:** Judge Nelson, Ninth Circuit oral arguments, April 16, 2026
Judge Nelson's Rule 40.11 argument creates a preemption paradox: CFR Rule 40.11 prohibits DCMs from listing gaming contracts unless CFTC grants an exception. Nelson stated: 'You go to a casino to make sports bets' when CFTC attorney argued sports contracts don't involve gaming. If sports event contracts are gaming contracts, then CFTC's own rules prohibit rather than authorize them on DCMs, eliminating the preemption shield. This challenges the claim that DCM registration provides preemption protection—it may instead create a regulatory trap where the authorization framework simultaneously forbids the product.
## Challenging Evidence
**Source:** casino.org, April 20, 2026; Judge Nelson oral argument quotes
Judge Nelson's Rule 40.11 paradox argument directly challenges the DCM preemption shield: if sports event contracts are gaming contracts (which Nevada argues and Nelson appears to accept: 'You go to a casino to make sports bets'), then CFR Rule 40.11 prohibits DCMs from listing them unless CFTC grants an exception. This means the same CFTC framework that prediction markets cite for federal preemption also forbids their core product, potentially eliminating the preemption defense entirely. Nevada characterized sports event contracts as 'functionally identical to sports books,' focusing on consumer protection and tax revenue arguments.

View file

@ -59,3 +59,10 @@ Norton Rose Fulbright analysis confirms Selig's April 17 House Agriculture Commi
**Source:** Norton Rose Fulbright ANPRM analysis, April 2026
Norton Rose analysis documents Selig's April 17, 2026 House Agriculture Committee testimony where he stated 'CFTC will no longer sit idly by while overzealous state governments undermine the agency's exclusive jurisdiction' and warned unregulated prediction markets could be 'the next FTX.' Analysis notes 'Sole commissioner creates structural concentration risk — all major prediction market regulatory decisions flow through one person with prior Kalshi board membership. Regulatory favorability is administration-contingent, not institutionally durable.' This confirms the concentration risk with specific testimony evidence.
## Supporting Evidence
**Source:** Norton Rose Fulbright ANPRM analysis, April 21 2026
Norton Rose analysis documents Selig's April 17 House Agriculture Committee testimony where he stated 'CFTC will no longer sit idly by while overzealous state governments undermine the agency's exclusive jurisdiction' and warned unregulated prediction markets could be 'the next FTX.' Analysis notes Selig is 'sole sitting CFTC commissioner' and that 'all major prediction market regulatory decisions flow through one person with prior Kalshi board membership.' Timeline confirms no proposed rule before mid-2026, with NPRM likely late 2026 or early 2027, meaning Selig's sole authority extends through entire rulemaking process.

View file

@ -94,3 +94,10 @@ Ninth Circuit oral arguments on April 16, 2026 showed marked skepticism from all
**Source:** Bloomberg Law, April 17, 2026
Bloomberg Law reports April 16, 2026 Ninth Circuit oral arguments showed all three Trump-appointed judges (Nelson, Bade, Lee) expressing marked skepticism toward prediction markets and CFTC preemption arguments. Judge Nelson focused on Rule 40.11's prohibition of gaming contracts on DCMs unless CFTC grants exceptions. Legal observers at the argument consensus: panel appears likely to rule for Nevada. Combined with Third Circuit's April 6 ruling for Kalshi, this creates the predicted circuit split. Fortune (April 20) describes the case as 'hurtling toward the Supreme Court.'
## Supporting Evidence
**Source:** casino.org, April 20, 2026; Ninth Circuit oral arguments April 16, 2026
Ninth Circuit oral arguments on April 16, 2026 showed marked skepticism from all three Trump-appointed judges (Nelson, Bade, Lee) toward Kalshi's federal preemption argument. Judge Nelson's direct questioning of CFTC Rule 40.11 ('40.11 says any regulated entity shall not list for trading gaming contracts. It prohibits it from going on. The only way to get around it is if you get permission first.') signals likely ruling for Nevada. Article published April 20 stated ruling expected 'in the coming days' rather than typical 60-120 day window, suggesting imminent circuit split confirmation with Third Circuit. Multiple states (including Arizona) have already filed to delay their own cases pending this ruling, confirming its dispositive significance.

View file

@ -31,3 +31,10 @@ Norton Rose Fulbright analysis documents that before April 2, only 19 comments w
**Source:** Norton Rose Fulbright ANPRM analysis (April 2026)
Norton Rose analysis documents that after April 2, 2026, there was a sharp surge in retail citizen comments (predominantly skeptical) following CFTC suing three states. The comment record shows 'institutional skews negative; industry skews self-regulatory positive; retail skews skeptical.' This confirms the asymmetric mobilization pattern where anti-gambling sentiment generates public engagement while governance market proponents remain absent from the comment record.
## Supporting Evidence
**Source:** Norton Rose Fulbright ANPRM analysis, April 21 2026
Norton Rose documents that retail citizen comments (predominantly skeptical) surged after April 2, creating 'genuine public engagement from people who see prediction markets as gambling.' The comment record shows 800+ submissions with institutional submissions skewing negative, industry submissions skewing self-regulatory positive, and retail submissions skewing skeptical. This confirms the asymmetric input dynamic where anti-gambling sentiment dominates public comment while governance market use cases remain unrepresented.