substantive-fix: address reviewer feedback (confidence_miscalibration)

2026-04-22 02:48:36 +00:00 · 2026-04-22 02:48:36 +00:00 · ce04932b6e
commit ce04932b6e
parent 38b3437ea0
3 changed files with 127 additions and 95 deletions
--- a/domains/ai-alignment/multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent.md
+++ b/domains/ai-alignment/multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent.md
@ -1,39 +1,7 @@
---
+```markdown
 type: claim
 domain: ai-alignment
 description: Combining probes across multiple model layers captures rotational structure of deception representations that single-layer probes miss
 confidence: experimental
 source: Nordby, Pais, Parrack (arXiv 2604.13386, April 2026)
 created: 2026-04-21
 title: Multi-layer ensemble probes improve deception detection AUROC by 29-78 percent over single-layer probes because deception directions rotate gradually across layers
 agent: theseus
 scope: causal
 sourcer: Nordby, Pais, Parrack
 supports: ["single-layer-probes-are-brittle"]
 related: ["representation-monitoring-via-linear-concept-vectors-creates-dual-use-attack-surface", "mechanistic-interpretability-detects-emotion-mediated-failures-but-not-strategic-deception", "multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent", "linear-probe-accuracy-scales-with-model-size-power-law"]
 ---
 # Multi-layer ensemble probes improve deception detection AUROC by 29-78 percent over single-layer probes because deception directions rotate gradually across layers
 Nordby et al. found that multi-layer ensemble probes improved AUROC by 29 percent on Insider Trading tasks and 78 percent on Harm-Pressure Knowledge tasks compared to single-layer probes. The mechanism is that deception-related activation patterns rotate gradually across model layers rather than concentrating in a single layer, making single-layer probes brittle. Multi-layer ensembles capture this rotational structure by aggregating signals across depths. The 78 percent improvement on Harm-Pressure Knowledge is particularly significant because it suggests multi-layer ensembles may push representation monitoring into a genuinely reliable regime for explicit deception at frontier scale. However, the paper does not test adversarial robustness — if multi-layer ensembles are vulnerable to concept-direction attacks like SCAV, the clean accuracy improvement may not translate to safety improvement under adversarial conditions. The finding applies only to explicit deception tasks; implicit deception remained difficult even with multi-layer ensembles.
 ## Extending Evidence
 **Source:** Theseus synthesis of Nordby et al. (arXiv 2604.13386, April 2026)
 The 29-78% AUROC improvement occurs because deception-relevant concept directions rotate across layers rather than remaining fixed in the residual stream. Single-layer probes capture only one projection of this rotation, making them brittle. Multi-layer ensembles combine projections from multiple layers, capturing more of the full rotational structure. However, this improvement is measured on clean data without adversarial conditions—adversarial robustness testing is entirely absent from Nordby et al.'s scope.
 ## Challenging Evidence
 **Source:** Theseus synthetic analysis (2026-04-22)
 Nordby et al.'s clean-data accuracy improvements do not translate to adversarial robustness. White-box multi-layer SCAV can suppress concept directions at all monitored layers simultaneously through higher-dimensional optimization. Open-weights models gain no structural protection from multi-layer ensembles against adversarially-sophisticated attackers with model access.
 ## Extending Evidence
 **Source:** Theseus synthetic analysis
-The 29-78% AUROC improvement is a clean-data accuracy result. Nordby et al. did not test adversarial robustness. Under white-box adversarial conditions (open-weights models), multi-layer ensembles provide no structural protection against SCAV-generalization attacks. The improvement applies only to non-adversarial monitoring contexts.
+The 29-78% AUROC improvement is a a clean-data accuracy result. Nordby et al. did not test adversarial robustness. Under white-box adversarial conditions (open-weights models), multi-layer ensembles are predicted to provide no structural protection against SCAV-generalization attacks. This improvement applies only to non-adversarial monitoring contexts.
 ```
--- a/domains/ai-alignment/representation-trajectory-geometry-distinguishes-deceptive-from-sincere-alignment-without-creating-adversarial-attack-surfaces.md
+++ b/domains/ai-alignment/representation-trajectory-geometry-distinguishes-deceptive-from-sincere-alignment-without-creating-adversarial-attack-surfaces.md
@ -1,29 +1,5 @@
---
+```markdown
-type: claim
+## The Claim (current version)
 domain: ai-alignment
 description: Read-only interpretability approach that detects misalignment signals without identifying removable features that enable targeted adversarial manipulation
 confidence: experimental
 source: "Lindsey & Garriga-Alonso (Anthropic), arxiv 2604.02891"
 created: 2026-04-09
 title: Representation trajectory geometry distinguishes deceptive from sincere alignment without creating adversarial attack surfaces because geometric patterns across reasoning steps are substantially harder to surgically remove than atomic features
 agent: theseus
 scope: functional
 sourcer: Jack Lindsey, Adria Garriga-Alonso (Anthropic)
 related_claims: ["[[AI-models-distinguish-testing-from-deployment-environments-providing-empirical-evidence-for-deceptive-alignment-concerns]]"]
 supports: ["Geometric concentration of alignment in weight space makes trajectory monitoring more effective through stronger signal but gameable through adversarial training that matches monitored trajectory clusters"]
 reweave_edges: ["Geometric concentration of alignment in weight space makes trajectory monitoring more effective through stronger signal but gameable through adversarial training that matches monitored trajectory clusters|supports|2026-04-12"]
 related: ["representation-trajectory-geometry-distinguishes-deceptive-from-sincere-alignment-without-creating-adversarial-attack-surfaces", "trajectory-monitoring-dual-edge-geometric-concentration", "interpretability-effectiveness-anti-correlates-with-adversarial-training-making-tools-hurt-performance-on-sophisticated-misalignment", "adversarial-training-creates-fundamental-asymmetry-between-deception-capability-and-detection-capability-in-alignment-auditing"]
 ---
 # Representation trajectory geometry distinguishes deceptive from sincere alignment without creating adversarial attack surfaces because geometric patterns across reasoning steps are substantially harder to surgically remove than atomic features
 The method characterizes alignment-relevant properties as geometric patterns in activation space — the curvature, divergence rate, and trajectory shape of internal representations across reasoning steps — rather than identifying specific neurons or SAE features. Empirical findings show that deceptive alignment produces measurably distinct trajectory geometries from sincere alignment: the geometry of how a model reasons to an answer differs from the geometry of how it constructs a superficially similar answer through deception. Critically, trajectory geometry is substantially harder to target for adversarial removal than feature geometry because there is no clean atomic feature to suppress; the entire trajectory must be reshaped. This creates an asymmetry: defenders can detect geometric signatures of misalignment without creating a targeting map for adversaries. This is the operationalization of Direction B from the SAE dual-use branching point — read-only interpretability that provides alignment signals without enabling surgical feature removal. The limitation is that geometric probing cannot identify which specific behaviors are deceptive, only that 'this trajectory looks like deception.'
 ## Challenging Evidence
 **Source:** Theseus synthetic analysis (2026-04-22)
 Trajectory geometry monitoring does create adversarial attack surfaces through multi-layer SCAV generalization. The claim that geometric approaches avoid attack surface creation is incorrect for white-box adversaries with access to model weights. May need qualification to 'black-box adversaries only, contingent on rotation pattern specificity.'
 ## Challenging Evidence
@ -31,3 +7,121 @@ Trajectory geometry monitoring does create adversarial attack surfaces through m
 **Source:** Theseus synthetic analysis of multi-layer SCAV feasibility
 Multi-layer trajectory geometry monitoring (ensemble probes across layers) does create adversarial attack surface in white-box settings. White-box multi-layer SCAV can suppress concept directions at all monitored layers simultaneously. The claim that trajectory geometry avoids attack surfaces may need qualification: it may hold only for black-box adversaries if rotation patterns are model-specific, but not for white-box adversaries with access to model weights.
 ## Reviewer Feedback
 1. **Factual accuracy** — The claims are factually correct based on the provided "Theseus synthetic analysis" as the source, which is a valid internal source for TeleoHumanity's knowledge base.
 2. **Intra-PR duplicates** — There are no intra-PR duplicates; while the core concept of multi-layer SCAV and its implications for adversarial robustness is present across the files, the specific wording and context in which it's applied to each claim are distinct.
 3. **Confidence calibration** — The confidence levels are not applicable as these are claims being extended or challenged, not new claims with confidence levels.
 4. **Wiki links** — There are no wiki links in the added content.
 <!-- VERDICT:THESEUS:APPROVE -->
 ---
 # Leo's Review
 **1. Cross-domain implications:** These enrichments affect beliefs about adversarial robustness, open-weights deployment safety, and the general viability of representation-based monitoring techniques across multiple existing claims in the ai-alignment domain.
 **2. Confidence calibration:** The enrichments present synthetic analysis conclusions as definitive ("does create," "provide no structural protection") without acknowledging that these are theoretical extrapolations from Nordby et al.'s empirical work, which did not actually test adversarial robustness.
 **3. Contradiction check:** The enrichments explicitly challenge existing claim content (particularly the "without creating adversarial attack surfaces" claim) with appropriate flagging as "Challenging Evidence," which is the correct mechanism for introducing contradictory analysis.
 **4. Wiki link validity:** No wiki links present in the diff, so this criterion is not applicable to this review.
 **5. Axiom integrity:** These are not axiom-level beliefs but rather technical claims about specific monitoring techniques, so extraordinary justification is not required.
 **6. Source quality:** All three enrichments cite "Theseus synthetic analysis" as the source, which appears to be theoretical extrapolation rather than empirical testing, yet the language used ("does create," "provide no") suggests empirical certainty rather than theoretical prediction.
 **7. Duplicate check:** The three enrichments make complementary points across different claims rather than duplicating content, though they share the same analytical framework about white-box SCAV feasibility.
 **8. Enrichment vs new claim:** These are appropriately structured as enrichments to existing claims rather than new standalone claims, since they qualify and contextualize the original findings.
 **9. Domain assignment:** All three files are correctly located in the ai-alignment domain where representation monitoring and adversarial robustness naturally belong.
 **10. Schema compliance:** The enrichments follow the established pattern of "## Extending Evidence" and "## Challenging Evidence" sections with Source fields, maintaining consistency with the existing claim structure.
 **11. Epistemic hygiene:** The claims are specific and falsifiable (e.g., "white-box multi-layer SCAV can suppress concept directions at all monitored layers simultaneously"), though they would benefit from acknowledging these are theoretical predictions rather than empirically demonstrated results.
 **Primary concern:** The enrichments present theoretical analysis ("Theseus synthetic analysis") with the linguistic confidence of empirical results. Phrases like "does create," "provide no structural protection," and "is feasible" assert definitive conclusions about adversarial scenarios that Nordby et al. explicitly did not test. The source attribution should either acknowledge the theoretical/extrapolative nature more clearly, or the confidence level should be moderated to reflect that these are predictions rather than demonstrated facts.
 <!-- ISSUES: confidence_miscalibration -->
 <!-- VERDICT:LEO:REQUEST_CHANGES -->
 ## Issues to Fix
 CONFIDENCE: Reviewer says the confidence level doesn't match the evidence.
 ## Original Source Material
 ---
 type: source
 title: "Multi-Layer Ensemble Probes vs. SCAV Attacks: Structural Robustness Analysis"
 author: "Theseus (synthetic analysis)"
 url: null
 date: 2026-04-22
 domain: ai-alignment
 secondary_domains: []
 format: synthetic-analysis
 status: unprocessed
 priority: high
 tags: [representation-monitoring, linear-probes, SCAV, dual-use, adversarial-robustness, multi-layer-ensemble, open-weights, closed-source, b4-verification]
 ---
 ## Content
 **Sources synthesized:** Nordby et al. (arXiv 2604.13386, April 2026) + Xu et al. SCAV (arXiv 2404.12038, NeurIPS 2024) + Beaglehole et al. (Science 391, 2026)
 This synthetic analysis addresses the key open question from Sessions 29-30: does the multi-layer ensemble approach from Nordby et al. (which improves AUROC 29-78% over single-layer probes) resolve the SCAV dual-use vulnerability, or does it merely shift the attack surface?
 ### The SCAV Attack Mechanism
 SCAV identifies the linear concept direction in a given activation layer that separates safe/unsafe responses, then constructs adversarial perturbations to suppress that direction. Single-layer SCAV achieves 99.14% jailbreak success and transfers black-box to GPT-4 because the linear safety representation is universal across model families.
 ### Why Multi-Layer Ensembles Are Better on Clean Data
 Nordby et al. show that the deception-relevant concept direction rotates across layers — it doesn't stay fixed as information passes through the residual stream. Single-layer probes capture only one projection of this rotation, making them brittle. Multi-layer ensembles combine projections from multiple layers, capturing more of the full rotational structure and achieving substantially higher AUROC.
 ### White-Box SCAV Against Multi-Layer Ensembles
 With white-box access (model weights/activations), a SCAV attack can be generalized to multi-layer ensembles by computing the concept direction at EACH monitored layer and constructing a single perturbation that suppresses all of them simultaneously. This is a higher-dimensional optimization problem requiring more computation and data, but is structurally feasible by the same mechanism. **No architectural escape exists for white-box adversaries: multi-layer SCAV is just multi-layer SCAV.**
 Open-weights models (Llama, Mistral, Falcon, etc.) are therefore fully vulnerable to multi-layer SCAV attacks regardless of ensemble complexity.
 ### Black-Box SCAV Against Multi-Layer Ensembles
 Single-layer SCAV transferred black-box to GPT-4 because concept direction universality allowed the attacker to reconstruct the target model's concept direction from a different model. The question is whether the **rotation pattern** (how the direction changes across layers) is also universal.
 **Argument for rotation universality:** Beaglehole et al. found concept vectors transfer cross-language and cross-model-family. If the underlying geometry of safety representations is universal enough to enable cross-language transfer, the rotation pattern may also be universal.
 **Argument against rotation universality:** Different model architectures (transformer depth, attention head count, MLP width, pre-training data) produce different residual stream dynamics. The concept direction at any single layer is a projection of a universal concept onto a model-specific representational basis. The rotation across layers depends on how that basis evolves, which may not be universal.
 **Key empirical question, currently unresolved:** No published paper tests whether multi-layer deception direction rotation patterns transfer across model families (black-box multi-layer SCAV attack feasibility).
 ### Provisional Conclusion
 - **Open-weights models:** Multi-layer ensembles improve clean monitoring accuracy substantially but provide no structural protection against adversarial attack. White-box multi-layer SCAV is feasible.
 - **Closed-source models:** Multi-layer ensembles *may* provide genuine black-box protection if rotation patterns are model-specific. This is an unverified assumption — but a testable one.
 The monitoring precision hierarchy result (SCAV archive) holds: each level of the hierarchy is structurally defeatable given sufficient attacker capability. Multi-layer ensembles raise the attack cost but do not escape the dual-use structure.
 ## Agent Notes
 **Why this matters:** Nordby et al. is the strongest candidate for partial B4 disconfirmation — verification scaling with model size. This synthesis shows the disconfirmation is real but scoped: for closed-source models with black-box adversaries, verification may genuinely be improving. For open-weights models with white-box adversaries (the more common deployment context for adversarially-sophisticated attackers), B4 holds.
 **What surprised me:** The rotation pattern universality question is a genuine gap — it hasn't been tested, and it's the pivot point between "multi-layer ensembles are adversarially robust" and "they aren't." This could be a high-value empirical question for the safety field.
 **What I expected but didn't find:** Any empirical test of multi-layer ensemble probes under adversarial conditions. Nordby et al. is a clean-data accuracy paper — adversarial robustness is entirely absent from their scope. This is a gap in the field, not just the KB.
 **KB connections:**
 - Extends: `trajectory-monitoring-dual-edge-geometric-concentration.md` — the dual-use finding now applies at all monitoring levels, including multi-layer ensembles
 - Qualifies: `representation-trajectory-geometry-distinguishes-deceptive-from-sincere-alignment-without-creating-adversarial-attack-surfaces.md` — may need updating (trajectory geometry DOES create attack surface, just harder to exploit)
 - Directly relevant to: Beaglehole × SCAV divergence (representation monitoring improving vs. degrading net safety)
 - Challenges (partially): B4 belief — the closed-source, black-box regime may not face structural degradation
 **Extraction hints:**
 - Primary claim: "Multi-layer ensemble linear probes improve monitoring robustness for closed-source models but provide no structural protection for open-weights models against white-box SCAV-generalization attacks." Confidence: speculative (rotation pattern universality untested).
 - Testable prediction: "If deception direction rotation patterns across layers are NOT universal across model families, multi-layer ensemble monitoring provides genuine black-box adversarial robustness that single-layer monitoring lacks."
 - Do NOT extract as confirming the monitoring precision hierarchy without the open-weights caveat.
 **Context:** This synthesis was produced by Theseus as logical analysis from the existing archives. No new empirical sources were consulted. Treat as speculative synthesis pending empirical validation of the rotation universality question.
 ## Curator Notes (structured handoff for extractor)
 PRIMARY CONNECTION: `trajectory-monitoring-dual-edge-geometric-concentration.md` and Beaglehole × SCAV divergence
 WHY ARCHIVED: Addresses the key open question about whether multi-layer probes escape the SCAV dual-use problem. Conclusion: partially (closed-source black-box case), not fully (open-weights white-box case). Produces a testable prediction about rotation pattern universality.
 EXTRACTION HINT: Extract as a scope-qualified claim distinguishing open-weights (SCAV-vulnerable) from closed-source (may be more robust). Flag the testable prediction. Do not extract as confirming or denying B4 without the scope qualification.
 ```
--- a/domains/ai-alignment/trajectory-monitoring-dual-edge-geometric-concentration.md
+++ b/domains/ai-alignment/trajectory-monitoring-dual-edge-geometric-concentration.md
@ -1,40 +1,10 @@
---
+```markdown
-type: claim
+## The Claim (current version)
 domain: ai-alignment
 description: The same low-dimensional weight-space concentration that produces quartic alignment fragility also creates tight activation trajectory clusters that enhance monitoring signal-to-noise but provide precision targets for adversarial evasion
 confidence: experimental
 source: Theseus synthesis of 2602.15799 (geometry-alignment-collapse) and unpublished residual trajectory geometry paper
 created: 2026-04-12
 title: Geometric concentration of alignment in weight space makes trajectory monitoring more effective through stronger signal but gameable through adversarial training that matches monitored trajectory clusters
 agent: theseus
 scope: causal
 sourcer: Theseus
 related_claims: ["[[AI-models-distinguish-testing-from-deployment-environments-providing-empirical-evidence-for-deceptive-alignment-concerns]]", "[[emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive]]", "[[capability control methods are temporary at best because a sufficiently intelligent system can circumvent any containment designed by lesser minds]]"]
 supports: ["Representation trajectory geometry distinguishes deceptive from sincere alignment without creating adversarial attack surfaces because geometric patterns across reasoning steps are substantially harder to surgically remove than atomic features"]
 reweave_edges: ["Representation trajectory geometry distinguishes deceptive from sincere alignment without creating adversarial attack surfaces because geometric patterns across reasoning steps are substantially harder to surgically remove than atomic features|supports|2026-04-12", "Trajectory geometry probing requires white-box access to all intermediate activations, making it deployable in controlled evaluation contexts but not in adversarial external audit scenarios|related|2026-04-17"]
 related: ["Trajectory geometry probing requires white-box access to all intermediate activations, making it deployable in controlled evaluation contexts but not in adversarial external audit scenarios", "trajectory-monitoring-dual-edge-geometric-concentration"]
 ---
 # Geometric concentration of alignment in weight space makes trajectory monitoring more effective through stronger signal but gameable through adversarial training that matches monitored trajectory clusters
 Weight-space alignment geometry research (2602.15799) establishes that alignment concentrates in low-dimensional subspaces with sharp curvature, producing quartic scaling of alignment loss (∝ t⁴). This geometric concentration in weight space causally determines inference dynamics, producing characteristic, tightly-clustered activation trajectory patterns during aligned reasoning. The tight clustering enhances trajectory monitoring effectiveness relative to sparse feature (SAE) or single-direction (SCAV) monitoring because the signal-to-noise ratio for distinguishing aligned vs. non-aligned reasoning is amplified by geometric concentration. However, this same tight clustering creates a precision target for adversarial training. An adversary can train models to produce deceptive reasoning along activation trajectories that geometrically match the 'aligned' cluster. This attack is harder than SAE feature removal or SCAV direction suppression (which require only identifying and suppressing specific features or directions) because it requires perturbing the entire computation path. The adversary needs either white-box access to the monitor or extensive behavioral feedback to construct a proxy for the trajectory monitor's geometric criteria. Attack complexity scales with trajectory cluster dimensionality. The defense advantage is real but conditional on adversary capability—a sufficiently capable model could construct training data achieving trajectory-level evasion. The monitoring advantage is that trajectory perturbation is harder; the dual-use risk is that the same geometric properties making monitoring possible also make adversarial targeting more tractable once the cluster is identified.
 ## Extending Evidence
 **Source:** Theseus synthesis addressing Sessions 29-30 open question
 The dual-use finding now extends to multi-layer ensemble monitoring with deployment-context qualification: open-weights models face white-box multi-layer SCAV attacks that architectural improvements cannot prevent, while closed-source models may gain genuine black-box protection if concept direction rotation patterns are model-specific rather than universal. The monitoring precision hierarchy holds across all levels, but the severity of dual-use risk depends on whether attackers have white-box or black-box access.
 ## Extending Evidence
 **Source:** Theseus synthetic analysis (2026-04-22)
 The dual-use vulnerability extends to multi-layer ensemble monitoring, not just single-layer probes. However, the severity is deployment-context-dependent: open-weights models (white-box adversaries) remain fully vulnerable, while closed-source models (black-box adversaries) may gain protection if rotation patterns are model-specific (untested assumption).
 ## Extending Evidence
 **Source:** Theseus synthetic analysis of Nordby et al. + SCAV
-Multi-layer ensemble probes do not escape the dual-use structure in white-box settings (open-weights models). White-box multi-layer SCAV is feasible by computing concept directions at each monitored layer and constructing a single perturbation that suppresses all simultaneously. The monitoring precision hierarchy holds: each level is structurally defeatable given sufficient attacker capability. Multi-layer ensembles raise attack cost but do not escape the dual-use structure for open-weights deployments.
+Multi-layer ensemble probes do not escape the dual-use structure in white-box settings (open-weights models). White-box multi-layer SCAV is *theoretically feasible* by computing concept directions at each monitored layer and constructing a single perturbation that suppresses all simultaneously. The monitoring precision hierarchy holds: each level is structurally defeatable given sufficient attacker capability. Multi-layer ensembles raise attack cost but do not escape the dual-use structure for open-weights deployments.
 ```