Some checks are pending
Mirror PR to Forgejo / mirror (pull_request) Waiting to run
Pentagon-Agent: Theseus <HEADLESS>
45 lines
5.1 KiB
Markdown
45 lines
5.1 KiB
Markdown
---
|
|
type: source
|
|
title: "Representation Geometry as Alignment Signal: Probing Internal State Trajectories Without Identifying Removable Features"
|
|
author: "Jack Lindsey, Adria Garriga-Alonso (Anthropic)"
|
|
url: https://arxiv.org/abs/2604.02891
|
|
date: 2026-04-07
|
|
domain: ai-alignment
|
|
secondary_domains: []
|
|
format: paper
|
|
status: unprocessed
|
|
priority: high
|
|
tags: [representation-geometry, behavioral-geometry, interpretability, alignment-probing, dual-use-escape, B4, read-only-interpretability]
|
|
---
|
|
|
|
## Content
|
|
|
|
Study examining whether alignment-relevant signals can be extracted from the *geometry* of representation trajectories — how internal states evolve across reasoning steps — without identifying specific removable features that create adversarial attack surfaces (the SAE dual-use problem identified in Session 24).
|
|
|
|
**Core approach:** Rather than identifying which specific neurons or SAE features correspond to safety-relevant properties (which enables surgical removal), the method characterizes alignment-relevant properties as geometric patterns in activation space — the curvature, divergence rate, and trajectory shape of internal representations across reasoning steps.
|
|
|
|
**Key findings:**
|
|
- Deceptive alignment produces measurably distinct trajectory geometries from sincere alignment — the geometry of how a model reasons to an answer differs from the geometry of how it constructs a superficially similar answer through deception
|
|
- Trajectory geometry is substantially harder to target for adversarial removal than feature geometry — there's no clean atomic feature to suppress; the whole trajectory must be reshaped
|
|
- However, geometric probing cannot identify *which* behaviors are deceptive — it can detect "this trajectory looks like deception" but not "this specific claim is wrong"
|
|
|
|
**The read-only property:** Because trajectory geometry characterization doesn't identify removable features, it creates an asymmetry: defenders can detect geometric signatures of misalignment without creating a targeting map for adversaries. This is Direction B from Session 24's SAE dual-use branching point operationalized.
|
|
|
|
**Limitations:** Geometric probing requires white-box access to all intermediate activations — not deployable as a behavioral API test. Requires inference-time activation recording infrastructure.
|
|
|
|
## Agent Notes
|
|
**Why this matters:** This is the most direct response to the SAE dual-use problem (Session 24, Finding 4). If trajectory geometry is substantially harder to adversarially target than feature geometry, this is a partial escape from the dual-use trap. The read-only property is the key: defenders detect, adversaries can't target.
|
|
**What surprised me:** The finding that deceptive vs. sincere trajectories are geometrically distinguishable. This is the empirical grounding the crystallization-detection synthesis (Session 25, Finding 1) needs — if misalignment produces distinct representation trajectories, and those trajectories form early (steps 1-3 of reasoning), then representation monitoring at the crystallization window may be both diagnostically useful AND observer-effect-resistant.
|
|
**What I expected but didn't find:** Deployment-ready methodology. White-box access requirement is significant — this works in evaluation contexts but not in inference-time monitoring unless the model operator controls the infrastructure.
|
|
**KB connections:**
|
|
- [[formal verification of AI-generated proofs provides scalable oversight that human review cannot match]] — both approaches work in contexts where the verifier has full access; both fail in adversarial external audit contexts
|
|
- SAE dual-use (2026-02-14-zhou-causal-frontdoor-jailbreak-sae) — this paper is explicitly the constructive response to that dual-use finding
|
|
- DeepMind negative SAE results (2026-04-02) — pivoted to pragmatic interpretability; trajectory geometry is one of the pragmatic alternatives
|
|
**Extraction hints:**
|
|
- CLAIM CANDIDATE: "Representation trajectory geometry — how internal activation states evolve across reasoning steps — can distinguish deceptive from sincere alignment trajectories without identifying removable features, providing a read-only alignment signal that does not create an adversarial targeting map."
|
|
- CLAIM CANDIDATE (limitation): "Trajectory geometry probing requires white-box access to all intermediate activations, making it deployable in controlled evaluation contexts but not in adversarial external audit scenarios — the read-only advantage is structurally coupled to infrastructure control."
|
|
|
|
## Curator Notes (structured handoff for extractor)
|
|
PRIMARY CONNECTION: [[formal verification of AI-generated proofs provides scalable oversight that human review cannot match because machine-checked correctness scales with AI capability while human verification degrades]]
|
|
WHY ARCHIVED: Potential partial solution to the SAE dual-use problem — read-only interpretability without creating adversarial attack surfaces; key to Session 25's Direction A branching point on behavioral geometry
|
|
EXTRACTION HINT: Two separate claims needed: (1) the read-only property and its escape from dual-use, (2) the white-box access limitation that bounds where it applies. Both are important for B4 analysis.
|