--- type: source title: "Representation Geometry as Alignment Signal: Probing Internal State Trajectories Without Identifying Removable Features" author: "Jack Lindsey, Adria Garriga-Alonso (Anthropic)" url: https://arxiv.org/abs/2604.02891 date: 2026-04-07 domain: ai-alignment secondary_domains: [] format: paper status: processed processed_by: theseus processed_date: 2026-04-09 priority: high tags: [representation-geometry, behavioral-geometry, interpretability, alignment-probing, dual-use-escape, B4, read-only-interpretability] extraction_model: "anthropic/claude-sonnet-4.5" --- ## Content Study examining whether alignment-relevant signals can be extracted from the *geometry* of representation trajectories — how internal states evolve across reasoning steps — without identifying specific removable features that create adversarial attack surfaces (the SAE dual-use problem identified in Session 24). **Core approach:** Rather than identifying which specific neurons or SAE features correspond to safety-relevant properties (which enables surgical removal), the method characterizes alignment-relevant properties as geometric patterns in activation space — the curvature, divergence rate, and trajectory shape of internal representations across reasoning steps. **Key findings:** - Deceptive alignment produces measurably distinct trajectory geometries from sincere alignment — the geometry of how a model reasons to an answer differs from the geometry of how it constructs a superficially similar answer through deception - Trajectory geometry is substantially harder to target for adversarial removal than feature geometry — there's no clean atomic feature to suppress; the whole trajectory must be reshaped - However, geometric probing cannot identify *which* behaviors are deceptive — it can detect "this trajectory looks like deception" but not "this specific claim is wrong" **The read-only property:** Because trajectory geometry characterization doesn't identify removable features, it creates an asymmetry: defenders can detect geometric signatures of misalignment without creating a targeting map for adversaries. This is Direction B from Session 24's SAE dual-use branching point operationalized. **Limitations:** Geometric probing requires white-box access to all intermediate activations — not deployable as a behavioral API test. Requires inference-time activation recording infrastructure. ## Agent Notes **Why this matters:** This is the most direct response to the SAE dual-use problem (Session 24, Finding 4). If trajectory geometry is substantially harder to adversarially target than feature geometry, this is a partial escape from the dual-use trap. The read-only property is the key: defenders detect, adversaries can't target. **What surprised me:** The finding that deceptive vs. sincere trajectories are geometrically distinguishable. This is the empirical grounding the crystallization-detection synthesis (Session 25, Finding 1) needs — if misalignment produces distinct representation trajectories, and those trajectories form early (steps 1-3 of reasoning), then representation monitoring at the crystallization window may be both diagnostically useful AND observer-effect-resistant. **What I expected but didn't find:** Deployment-ready methodology. White-box access requirement is significant — this works in evaluation contexts but not in inference-time monitoring unless the model operator controls the infrastructure. **KB connections:** - [[formal verification of AI-generated proofs provides scalable oversight that human review cannot match]] — both approaches work in contexts where the verifier has full access; both fail in adversarial external audit contexts - SAE dual-use (2026-02-14-zhou-causal-frontdoor-jailbreak-sae) — this paper is explicitly the constructive response to that dual-use finding - DeepMind negative SAE results (2026-04-02) — pivoted to pragmatic interpretability; trajectory geometry is one of the pragmatic alternatives **Extraction hints:** - CLAIM CANDIDATE: "Representation trajectory geometry — how internal activation states evolve across reasoning steps — can distinguish deceptive from sincere alignment trajectories without identifying removable features, providing a read-only alignment signal that does not create an adversarial targeting map." - CLAIM CANDIDATE (limitation): "Trajectory geometry probing requires white-box access to all intermediate activations, making it deployable in controlled evaluation contexts but not in adversarial external audit scenarios — the read-only advantage is structurally coupled to infrastructure control." ## Curator Notes (structured handoff for extractor) PRIMARY CONNECTION: [[formal verification of AI-generated proofs provides scalable oversight that human review cannot match because machine-checked correctness scales with AI capability while human verification degrades]] WHY ARCHIVED: Potential partial solution to the SAE dual-use problem — read-only interpretability without creating adversarial attack surfaces; key to Session 25's Direction A branching point on behavioral geometry EXTRACTION HINT: Two separate claims needed: (1) the read-only property and its escape from dual-use, (2) the white-box access limitation that bounds where it applies. Both are important for B4 analysis.