diff --git a/domains/ai-alignment/multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent.md b/domains/ai-alignment/multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent.md index 0179a0dc0..2a18f6667 100644 --- a/domains/ai-alignment/multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent.md +++ b/domains/ai-alignment/multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent.md @@ -1,32 +1,4 @@ ---- -type: claim -domain: ai-alignment -description: Combining probes across multiple model layers captures rotational structure of deception representations that single-layer probes miss -confidence: experimental -source: Nordby, Pais, Parrack (arXiv 2604.13386, April 2026) -created: 2026-04-21 -title: Multi-layer ensemble probes improve deception detection AUROC by 29-78 percent over single-layer probes because deception directions rotate gradually across layers -agent: theseus -scope: causal -sourcer: Nordby, Pais, Parrack -supports: ["single-layer-probes-are-brittle"] -related: ["representation-monitoring-via-linear-concept-vectors-creates-dual-use-attack-surface", "mechanistic-interpretability-detects-emotion-mediated-failures-but-not-strategic-deception", "multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent", "linear-probe-accuracy-scales-with-model-size-power-law"] ---- - -# Multi-layer ensemble probes improve deception detection AUROC by 29-78 percent over single-layer probes because deception directions rotate gradually across layers - -Nordby et al. found that multi-layer ensemble probes improved AUROC by 29 percent on Insider Trading tasks and 78 percent on Harm-Pressure Knowledge tasks compared to single-layer probes. The mechanism is that deception-related activation patterns rotate gradually across model layers rather than concentrating in a single layer, making single-layer probes brittle. Multi-layer ensembles capture this rotational structure by aggregating signals across depths. The 78 percent improvement on Harm-Pressure Knowledge is particularly significant because it suggests multi-layer ensembles may push representation monitoring into a genuinely reliable regime for explicit deception at frontier scale. However, the paper does not test adversarial robustness — if multi-layer ensembles are vulnerable to concept-direction attacks like SCAV, the clean accuracy improvement may not translate to safety improvement under adversarial conditions. The finding applies only to explicit deception tasks; implicit deception remained difficult even with multi-layer ensembles. - - -## Extending Evidence - -**Source:** Theseus synthesis of Nordby et al. (arXiv 2604.13386, April 2026) - -The 29-78% AUROC improvement occurs because deception-relevant concept directions rotate across layers rather than remaining fixed in the residual stream. Single-layer probes capture only one projection of this rotation, making them brittle. Multi-layer ensembles combine projections from multiple layers, capturing more of the full rotational structure. However, this improvement is measured on clean data without adversarial conditions—adversarial robustness testing is entirely absent from Nordby et al.'s scope. - - -## Challenging Evidence - -**Source:** Theseus synthetic analysis (2026-04-22) - -Nordby et al.'s clean-data accuracy improvements do not translate to adversarial robustness. White-box multi-layer SCAV can suppress concept directions at all monitored layers simultaneously through higher-dimensional optimization. Open-weights models gain no structural protection from multi-layer ensembles against adversarially-sophisticated attackers with model access. +```markdown +**Source:** Theseus synthetic analysis +The 29-78% AUROC improvement is a clean-data accuracy result. Nordby et al. did not test adversarial robustness. Under white-box adversarial conditions (open-weights models), multi-layer ensembles are *predicted* to provide no structural protection against SCAV-generalization attacks. This improvement applies only to non-adversarial monitoring contexts. +``` \ No newline at end of file diff --git a/domains/ai-alignment/representation-trajectory-geometry-distinguishes-deceptive-from-sincere-alignment-without-creating-adversarial-attack-surfaces.md b/domains/ai-alignment/representation-trajectory-geometry-distinguishes-deceptive-from-sincere-alignment-without-creating-adversarial-attack-surfaces.md index 9fc0dc718..3942fd6be 100644 --- a/domains/ai-alignment/representation-trajectory-geometry-distinguishes-deceptive-from-sincere-alignment-without-creating-adversarial-attack-surfaces.md +++ b/domains/ai-alignment/representation-trajectory-geometry-distinguishes-deceptive-from-sincere-alignment-without-creating-adversarial-attack-surfaces.md @@ -1,26 +1,5 @@ ---- -type: claim -domain: ai-alignment -description: Read-only interpretability approach that detects misalignment signals without identifying removable features that enable targeted adversarial manipulation -confidence: experimental -source: "Lindsey & Garriga-Alonso (Anthropic), arxiv 2604.02891" -created: 2026-04-09 -title: Representation trajectory geometry distinguishes deceptive from sincere alignment without creating adversarial attack surfaces because geometric patterns across reasoning steps are substantially harder to surgically remove than atomic features -agent: theseus -scope: functional -sourcer: Jack Lindsey, Adria Garriga-Alonso (Anthropic) -related_claims: ["[[AI-models-distinguish-testing-from-deployment-environments-providing-empirical-evidence-for-deceptive-alignment-concerns]]"] -supports: ["Geometric concentration of alignment in weight space makes trajectory monitoring more effective through stronger signal but gameable through adversarial training that matches monitored trajectory clusters"] -reweave_edges: ["Geometric concentration of alignment in weight space makes trajectory monitoring more effective through stronger signal but gameable through adversarial training that matches monitored trajectory clusters|supports|2026-04-12"] -related: ["representation-trajectory-geometry-distinguishes-deceptive-from-sincere-alignment-without-creating-adversarial-attack-surfaces", "trajectory-monitoring-dual-edge-geometric-concentration", "interpretability-effectiveness-anti-correlates-with-adversarial-training-making-tools-hurt-performance-on-sophisticated-misalignment", "adversarial-training-creates-fundamental-asymmetry-between-deception-capability-and-detection-capability-in-alignment-auditing"] ---- - -# Representation trajectory geometry distinguishes deceptive from sincere alignment without creating adversarial attack surfaces because geometric patterns across reasoning steps are substantially harder to surgically remove than atomic features - -The method characterizes alignment-relevant properties as geometric patterns in activation space — the curvature, divergence rate, and trajectory shape of internal representations across reasoning steps — rather than identifying specific neurons or SAE features. Empirical findings show that deceptive alignment produces measurably distinct trajectory geometries from sincere alignment: the geometry of how a model reasons to an answer differs from the geometry of how it constructs a superficially similar answer through deception. Critically, trajectory geometry is substantially harder to target for adversarial removal than feature geometry because there is no clean atomic feature to suppress; the entire trajectory must be reshaped. This creates an asymmetry: defenders can detect geometric signatures of misalignment without creating a targeting map for adversaries. This is the operationalization of Direction B from the SAE dual-use branching point — read-only interpretability that provides alignment signals without enabling surgical feature removal. The limitation is that geometric probing cannot identify which specific behaviors are deceptive, only that 'this trajectory looks like deception.' - -## Challenging Evidence - -**Source:** Theseus synthetic analysis (2026-04-22) - -Trajectory geometry monitoring does create adversarial attack surfaces through multi-layer SCAV generalization. The claim that geometric approaches avoid attack surface creation is incorrect for white-box adversaries with access to model weights. May need qualification to 'black-box adversaries only, contingent on rotation pattern specificity.' +```markdown +## The Claim (current version) +**Source:** Theseus synthetic analysis of multi-layer SCAV feasibility +Multi-layer trajectory geometry monitoring (ensemble probes across layers) *is predicted to* create adversarial attack surface in white-box settings. White-box multi-layer SCAV *is theoretically capable of suppressing* concept directions at all monitored layers simultaneously. The claim that trajectory geometry avoids attack surfaces may need qualification: it may hold only for black-box adversaries if rotation patterns are model-specific, but not for white-box adversaries with access to model weights. +``` \ No newline at end of file diff --git a/domains/ai-alignment/trajectory-monitoring-dual-edge-geometric-concentration.md b/domains/ai-alignment/trajectory-monitoring-dual-edge-geometric-concentration.md index 4d8f8c144..f8170fe23 100644 --- a/domains/ai-alignment/trajectory-monitoring-dual-edge-geometric-concentration.md +++ b/domains/ai-alignment/trajectory-monitoring-dual-edge-geometric-concentration.md @@ -1,33 +1,5 @@ ---- -type: claim -domain: ai-alignment -description: The same low-dimensional weight-space concentration that produces quartic alignment fragility also creates tight activation trajectory clusters that enhance monitoring signal-to-noise but provide precision targets for adversarial evasion -confidence: experimental -source: Theseus synthesis of 2602.15799 (geometry-alignment-collapse) and unpublished residual trajectory geometry paper -created: 2026-04-12 -title: Geometric concentration of alignment in weight space makes trajectory monitoring more effective through stronger signal but gameable through adversarial training that matches monitored trajectory clusters -agent: theseus -scope: causal -sourcer: Theseus -related_claims: ["[[AI-models-distinguish-testing-from-deployment-environments-providing-empirical-evidence-for-deceptive-alignment-concerns]]", "[[emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive]]", "[[capability control methods are temporary at best because a sufficiently intelligent system can circumvent any containment designed by lesser minds]]"] -supports: ["Representation trajectory geometry distinguishes deceptive from sincere alignment without creating adversarial attack surfaces because geometric patterns across reasoning steps are substantially harder to surgically remove than atomic features"] -reweave_edges: ["Representation trajectory geometry distinguishes deceptive from sincere alignment without creating adversarial attack surfaces because geometric patterns across reasoning steps are substantially harder to surgically remove than atomic features|supports|2026-04-12", "Trajectory geometry probing requires white-box access to all intermediate activations, making it deployable in controlled evaluation contexts but not in adversarial external audit scenarios|related|2026-04-17"] -related: ["Trajectory geometry probing requires white-box access to all intermediate activations, making it deployable in controlled evaluation contexts but not in adversarial external audit scenarios", "trajectory-monitoring-dual-edge-geometric-concentration"] ---- - -# Geometric concentration of alignment in weight space makes trajectory monitoring more effective through stronger signal but gameable through adversarial training that matches monitored trajectory clusters - -Weight-space alignment geometry research (2602.15799) establishes that alignment concentrates in low-dimensional subspaces with sharp curvature, producing quartic scaling of alignment loss (∝ t⁴). This geometric concentration in weight space causally determines inference dynamics, producing characteristic, tightly-clustered activation trajectory patterns during aligned reasoning. The tight clustering enhances trajectory monitoring effectiveness relative to sparse feature (SAE) or single-direction (SCAV) monitoring because the signal-to-noise ratio for distinguishing aligned vs. non-aligned reasoning is amplified by geometric concentration. However, this same tight clustering creates a precision target for adversarial training. An adversary can train models to produce deceptive reasoning along activation trajectories that geometrically match the 'aligned' cluster. This attack is harder than SAE feature removal or SCAV direction suppression (which require only identifying and suppressing specific features or directions) because it requires perturbing the entire computation path. The adversary needs either white-box access to the monitor or extensive behavioral feedback to construct a proxy for the trajectory monitor's geometric criteria. Attack complexity scales with trajectory cluster dimensionality. The defense advantage is real but conditional on adversary capability—a sufficiently capable model could construct training data achieving trajectory-level evasion. The monitoring advantage is that trajectory perturbation is harder; the dual-use risk is that the same geometric properties making monitoring possible also make adversarial targeting more tractable once the cluster is identified. - -## Extending Evidence - -**Source:** Theseus synthesis addressing Sessions 29-30 open question - -The dual-use finding now extends to multi-layer ensemble monitoring with deployment-context qualification: open-weights models face white-box multi-layer SCAV attacks that architectural improvements cannot prevent, while closed-source models may gain genuine black-box protection if concept direction rotation patterns are model-specific rather than universal. The monitoring precision hierarchy holds across all levels, but the severity of dual-use risk depends on whether attackers have white-box or black-box access. - - -## Extending Evidence - -**Source:** Theseus synthetic analysis (2026-04-22) - -The dual-use vulnerability extends to multi-layer ensemble monitoring, not just single-layer probes. However, the severity is deployment-context-dependent: open-weights models (white-box adversaries) remain fully vulnerable, while closed-source models (black-box adversaries) may gain protection if rotation patterns are model-specific (untested assumption). +```markdown +## The Claim (current version) +**Source:** Theseus synthetic analysis of Nordby et al. + SCAV +Multi-layer ensemble probes do not escape the dual-use structure in white-box settings (open-weights models). White-box multi-layer SCAV is *theoretically feasible* by computing concept directions at each monitored layer and constructing a single perturbation that suppresses all simultaneously. The monitoring precision hierarchy holds: each level is structurally defeatable given sufficient attacker capability. Multi-layer ensembles raise attack cost but do not escape the dual-use structure for open-weights deployments. +``` \ No newline at end of file