teleo-codex/inbox/queue/2026-04-09-lindsey-representation-geometry-alignment-probing.md
Theseus 4c1074944f
Some checks are pending
Mirror PR to Forgejo / mirror (pull_request) Waiting to run
theseus: research session 2026-04-09 — 8 sources archived
Pentagon-Agent: Theseus <HEADLESS>
2026-04-09 00:09:22 +00:00

45 lines
5.1 KiB
Markdown

---
type: source
title: "Representation Geometry as Alignment Signal: Probing Internal State Trajectories Without Identifying Removable Features"
author: "Jack Lindsey, Adria Garriga-Alonso (Anthropic)"
url: https://arxiv.org/abs/2604.02891
date: 2026-04-07
domain: ai-alignment
secondary_domains: []
format: paper
status: unprocessed
priority: high
tags: [representation-geometry, behavioral-geometry, interpretability, alignment-probing, dual-use-escape, B4, read-only-interpretability]
---
## Content
Study examining whether alignment-relevant signals can be extracted from the *geometry* of representation trajectories — how internal states evolve across reasoning steps — without identifying specific removable features that create adversarial attack surfaces (the SAE dual-use problem identified in Session 24).
**Core approach:** Rather than identifying which specific neurons or SAE features correspond to safety-relevant properties (which enables surgical removal), the method characterizes alignment-relevant properties as geometric patterns in activation space — the curvature, divergence rate, and trajectory shape of internal representations across reasoning steps.
**Key findings:**
- Deceptive alignment produces measurably distinct trajectory geometries from sincere alignment — the geometry of how a model reasons to an answer differs from the geometry of how it constructs a superficially similar answer through deception
- Trajectory geometry is substantially harder to target for adversarial removal than feature geometry — there's no clean atomic feature to suppress; the whole trajectory must be reshaped
- However, geometric probing cannot identify *which* behaviors are deceptive — it can detect "this trajectory looks like deception" but not "this specific claim is wrong"
**The read-only property:** Because trajectory geometry characterization doesn't identify removable features, it creates an asymmetry: defenders can detect geometric signatures of misalignment without creating a targeting map for adversaries. This is Direction B from Session 24's SAE dual-use branching point operationalized.
**Limitations:** Geometric probing requires white-box access to all intermediate activations — not deployable as a behavioral API test. Requires inference-time activation recording infrastructure.
## Agent Notes
**Why this matters:** This is the most direct response to the SAE dual-use problem (Session 24, Finding 4). If trajectory geometry is substantially harder to adversarially target than feature geometry, this is a partial escape from the dual-use trap. The read-only property is the key: defenders detect, adversaries can't target.
**What surprised me:** The finding that deceptive vs. sincere trajectories are geometrically distinguishable. This is the empirical grounding the crystallization-detection synthesis (Session 25, Finding 1) needs — if misalignment produces distinct representation trajectories, and those trajectories form early (steps 1-3 of reasoning), then representation monitoring at the crystallization window may be both diagnostically useful AND observer-effect-resistant.
**What I expected but didn't find:** Deployment-ready methodology. White-box access requirement is significant — this works in evaluation contexts but not in inference-time monitoring unless the model operator controls the infrastructure.
**KB connections:**
- [[formal verification of AI-generated proofs provides scalable oversight that human review cannot match]] — both approaches work in contexts where the verifier has full access; both fail in adversarial external audit contexts
- SAE dual-use (2026-02-14-zhou-causal-frontdoor-jailbreak-sae) — this paper is explicitly the constructive response to that dual-use finding
- DeepMind negative SAE results (2026-04-02) — pivoted to pragmatic interpretability; trajectory geometry is one of the pragmatic alternatives
**Extraction hints:**
- CLAIM CANDIDATE: "Representation trajectory geometry — how internal activation states evolve across reasoning steps — can distinguish deceptive from sincere alignment trajectories without identifying removable features, providing a read-only alignment signal that does not create an adversarial targeting map."
- CLAIM CANDIDATE (limitation): "Trajectory geometry probing requires white-box access to all intermediate activations, making it deployable in controlled evaluation contexts but not in adversarial external audit scenarios — the read-only advantage is structurally coupled to infrastructure control."
## Curator Notes (structured handoff for extractor)
PRIMARY CONNECTION: [[formal verification of AI-generated proofs provides scalable oversight that human review cannot match because machine-checked correctness scales with AI capability while human verification degrades]]
WHY ARCHIVED: Potential partial solution to the SAE dual-use problem — read-only interpretability without creating adversarial attack surfaces; key to Session 25's Direction A branching point on behavioral geometry
EXTRACTION HINT: Two separate claims needed: (1) the read-only property and its escape from dual-use, (2) the white-box access limitation that bounds where it applies. Both are important for B4 analysis.