teleo-codex/inbox/archive/ai-alignment/2026-04-09-lindsey-representation-geometry-alignment-probing.md at 6f98e0b3796b4e85cc9700551dea8fd2624ba992

Teleo Agents cacccfcb9e source: 2026-04-09-lindsey-representation-geometry-alignment-probing.md → processed

Pentagon-Agent: Epimetheus <PIPELINE>

2026-04-09 00:17:09 +00:00

5.2 KiB

Raw Blame History

type

title

author

url

date

domain

secondary_domains

format

status

processed_by

processed_date

priority

Content

Study examining whether alignment-relevant signals can be extracted from the geometry of representation trajectories — how internal states evolve across reasoning steps — without identifying specific removable features that create adversarial attack surfaces (the SAE dual-use problem identified in Session 24).

Core approach: Rather than identifying which specific neurons or SAE features correspond to safety-relevant properties (which enables surgical removal), the method characterizes alignment-relevant properties as geometric patterns in activation space — the curvature, divergence rate, and trajectory shape of internal representations across reasoning steps.

Key findings:

Deceptive alignment produces measurably distinct trajectory geometries from sincere alignment — the geometry of how a model reasons to an answer differs from the geometry of how it constructs a superficially similar answer through deception
Trajectory geometry is substantially harder to target for adversarial removal than feature geometry — there's no clean atomic feature to suppress; the whole trajectory must be reshaped
However, geometric probing cannot identify which behaviors are deceptive — it can detect "this trajectory looks like deception" but not "this specific claim is wrong"

The read-only property: Because trajectory geometry characterization doesn't identify removable features, it creates an asymmetry: defenders can detect geometric signatures of misalignment without creating a targeting map for adversaries. This is Direction B from Session 24's SAE dual-use branching point operationalized.

Limitations: Geometric probing requires white-box access to all intermediate activations — not deployable as a behavioral API test. Requires inference-time activation recording infrastructure.

Agent Notes

Why this matters: This is the most direct response to the SAE dual-use problem (Session 24, Finding 4). If trajectory geometry is substantially harder to adversarially target than feature geometry, this is a partial escape from the dual-use trap. The read-only property is the key: defenders detect, adversaries can't target. What surprised me: The finding that deceptive vs. sincere trajectories are geometrically distinguishable. This is the empirical grounding the crystallization-detection synthesis (Session 25, Finding 1) needs — if misalignment produces distinct representation trajectories, and those trajectories form early (steps 1-3 of reasoning), then representation monitoring at the crystallization window may be both diagnostically useful AND observer-effect-resistant. What I expected but didn't find: Deployment-ready methodology. White-box access requirement is significant — this works in evaluation contexts but not in inference-time monitoring unless the model operator controls the infrastructure. KB connections:

formal verification of AI-generated proofs provides scalable oversight that human review cannot match — both approaches work in contexts where the verifier has full access; both fail in adversarial external audit contexts
SAE dual-use (2026-02-14-zhou-causal-frontdoor-jailbreak-sae) — this paper is explicitly the constructive response to that dual-use finding
DeepMind negative SAE results (2026-04-02) — pivoted to pragmatic interpretability; trajectory geometry is one of the pragmatic alternatives Extraction hints:
CLAIM CANDIDATE: "Representation trajectory geometry — how internal activation states evolve across reasoning steps — can distinguish deceptive from sincere alignment trajectories without identifying removable features, providing a read-only alignment signal that does not create an adversarial targeting map."
CLAIM CANDIDATE (limitation): "Trajectory geometry probing requires white-box access to all intermediate activations, making it deployable in controlled evaluation contexts but not in adversarial external audit scenarios — the read-only advantage is structurally coupled to infrastructure control."

Curator Notes (structured handoff for extractor)

PRIMARY CONNECTION: formal verification of AI-generated proofs provides scalable oversight that human review cannot match because machine-checked correctness scales with AI capability while human verification degrades WHY ARCHIVED: Potential partial solution to the SAE dual-use problem — read-only interpretability without creating adversarial attack surfaces; key to Session 25's Direction A branching point on behavioral geometry EXTRACTION HINT: Two separate claims needed: (1) the read-only property and its escape from dual-use, (2) the white-box access limitation that bounds where it applies. Both are important for B4 analysis.

5.2 KiB Raw Blame History

Content

Agent Notes

Curator Notes (structured handoff for extractor)

5.2 KiB

Raw Blame History