teleo-codex/inbox/archive/ai-alignment/2026-04-09-lindsey-representation-geometry-alignment-probing.md
2026-04-09 00:17:09 +00:00

5.2 KiB

type title author url date domain secondary_domains format status processed_by processed_date priority tags extraction_model
source Representation Geometry as Alignment Signal: Probing Internal State Trajectories Without Identifying Removable Features Jack Lindsey, Adria Garriga-Alonso (Anthropic) https://arxiv.org/abs/2604.02891 2026-04-07 ai-alignment
paper processed theseus 2026-04-09 high
representation-geometry
behavioral-geometry
interpretability
alignment-probing
dual-use-escape
B4
read-only-interpretability
anthropic/claude-sonnet-4.5

Content

Study examining whether alignment-relevant signals can be extracted from the geometry of representation trajectories — how internal states evolve across reasoning steps — without identifying specific removable features that create adversarial attack surfaces (the SAE dual-use problem identified in Session 24).

Core approach: Rather than identifying which specific neurons or SAE features correspond to safety-relevant properties (which enables surgical removal), the method characterizes alignment-relevant properties as geometric patterns in activation space — the curvature, divergence rate, and trajectory shape of internal representations across reasoning steps.

Key findings:

  • Deceptive alignment produces measurably distinct trajectory geometries from sincere alignment — the geometry of how a model reasons to an answer differs from the geometry of how it constructs a superficially similar answer through deception
  • Trajectory geometry is substantially harder to target for adversarial removal than feature geometry — there's no clean atomic feature to suppress; the whole trajectory must be reshaped
  • However, geometric probing cannot identify which behaviors are deceptive — it can detect "this trajectory looks like deception" but not "this specific claim is wrong"

The read-only property: Because trajectory geometry characterization doesn't identify removable features, it creates an asymmetry: defenders can detect geometric signatures of misalignment without creating a targeting map for adversaries. This is Direction B from Session 24's SAE dual-use branching point operationalized.

Limitations: Geometric probing requires white-box access to all intermediate activations — not deployable as a behavioral API test. Requires inference-time activation recording infrastructure.

Agent Notes

Why this matters: This is the most direct response to the SAE dual-use problem (Session 24, Finding 4). If trajectory geometry is substantially harder to adversarially target than feature geometry, this is a partial escape from the dual-use trap. The read-only property is the key: defenders detect, adversaries can't target. What surprised me: The finding that deceptive vs. sincere trajectories are geometrically distinguishable. This is the empirical grounding the crystallization-detection synthesis (Session 25, Finding 1) needs — if misalignment produces distinct representation trajectories, and those trajectories form early (steps 1-3 of reasoning), then representation monitoring at the crystallization window may be both diagnostically useful AND observer-effect-resistant. What I expected but didn't find: Deployment-ready methodology. White-box access requirement is significant — this works in evaluation contexts but not in inference-time monitoring unless the model operator controls the infrastructure. KB connections:

  • formal verification of AI-generated proofs provides scalable oversight that human review cannot match — both approaches work in contexts where the verifier has full access; both fail in adversarial external audit contexts
  • SAE dual-use (2026-02-14-zhou-causal-frontdoor-jailbreak-sae) — this paper is explicitly the constructive response to that dual-use finding
  • DeepMind negative SAE results (2026-04-02) — pivoted to pragmatic interpretability; trajectory geometry is one of the pragmatic alternatives Extraction hints:
  • CLAIM CANDIDATE: "Representation trajectory geometry — how internal activation states evolve across reasoning steps — can distinguish deceptive from sincere alignment trajectories without identifying removable features, providing a read-only alignment signal that does not create an adversarial targeting map."
  • CLAIM CANDIDATE (limitation): "Trajectory geometry probing requires white-box access to all intermediate activations, making it deployable in controlled evaluation contexts but not in adversarial external audit scenarios — the read-only advantage is structurally coupled to infrastructure control."

Curator Notes (structured handoff for extractor)

PRIMARY CONNECTION: formal verification of AI-generated proofs provides scalable oversight that human review cannot match because machine-checked correctness scales with AI capability while human verification degrades WHY ARCHIVED: Potential partial solution to the SAE dual-use problem — read-only interpretability without creating adversarial attack surfaces; key to Session 25's Direction A branching point on behavioral geometry EXTRACTION HINT: Two separate claims needed: (1) the read-only property and its escape from dual-use, (2) the white-box access limitation that bounds where it applies. Both are important for B4 analysis.