5.7 KiB
| type | title | author | url | date | domain | secondary_domains | format | status | priority | tags | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| source | Residual Trajectory Geometry Interpretability (Pre-Publication) | Unknown (seeking arXiv endorsement) | https://discuss.huggingface.co/t/request-for-arxiv-cs-lg-endorsement-interpretability-paper-residual-trajectory-geometry/173697 | 2026-04-11 | ai-alignment | preprint-draft | unprocessed | medium |
|
Content
Unpublished paper seeking arXiv endorsement. Studies transformer model computation through the geometry of residual update trajectories — how information flows through transformer layers geometrically.
Three main claims:
- "Reasoning tokens occupy higher-dimensional task-aligned subspaces than syntactic or factual continuations"
- "Projecting FFN updates into these subspaces causally improves reasoning confidence"
- "Aligned reasoning trajectories emerge consistently across depth and across independently trained models"
Models used: TinyLlama, Phi-2, Qwen (open-weights, smaller models)
Methodology: Studies how transformer residual updates traverse geometric space — essentially asking which path through activation space the model takes during different types of computation.
Key implication for monitoring: If REASONING tokens occupy distinguishably higher-dimensional subspaces than syntactic tokens, AND if aligned reasoning trajectories are consistent across models, then DECEPTIVE reasoning might occupy different geometric space from aligned reasoning — potentially detectable through trajectory pattern matching rather than feature identification.
Agent Notes
Why this matters: This is the most promising candidate for a "truly read-only" monitoring approach. The key property: if monitoring is based on which GEOMETRIC TRAJECTORY the reasoning takes (not which features are active), then suppressing the monitoring signal requires changing the ENTIRE REASONING PATH — not just suppressing a specific feature or direction. This is qualitatively harder than SCAV-style suppression.
However, major caveats:
- UNPUBLISHED: Not on arXiv, no peer review, seeking endorsement
- Tested only on small open-weights models (not frontier)
- Only three claims from an abstract — full methodology unknown
- No adversarial robustness testing reported
The dual-use question for this approach:
- If deceptive reasoning occupies a DIFFERENT geometric trajectory than aligned reasoning, can an adversary be trained to produce deceptive reasoning on the aligned trajectory?
- This is harder than SCAV-style suppression because you'd need to change not just which features activate but HOW the computation unfolds geometrically
- But if "aligned reasoning trajectories emerge consistently across independently trained models," then knowing what "aligned" looks like might also tell you what trajectory to mimic
What surprised me: The claim that "aligned reasoning trajectories emerge consistently across depth and across independently trained models." This is a strong universality claim — if true at frontier scale, it suggests the geometric approach might generalize across model families, making multi-model monitoring feasible. But it's unverified on frontier models.
What I expected but didn't find: Any application to safety or adversarial robustness. This paper appears to be primarily about understanding transformer computation, with safety implications as a secondary consequence.
KB connections:
- The Session 25 crystallization-detection synthesis — SafeThink found that reasoning tokens in steps 1-3 determine safety trajectory. If those early-step tokens occupy a distinguishably different geometric subspace when the reasoning is deceptive vs. aligned, then trajectory geometry monitoring would be the most promising approach for early-step detection.
- AI capability and reliability are independent dimensions — the trajectory geometry finding (reasoning tokens occupy higher-dimensional subspaces) might explain this: capability jumps involve access to higher-dimensional reasoning subspaces, while reliability failures occur when the model "falls back" to lower-dimensional factual/syntactic trajectories mid-task.
Extraction hints:
- Do NOT extract claims from this source until it's peer-reviewed and on arXiv
- Archive as MONITORING, not extraction
- Re-check in 2-3 months when arXiv submission is likely completed
- The CLAIM CANDIDATE it generates: "Trajectory geometry monitoring of reasoning token subspaces may provide a structurally harder-to-game safety monitoring approach than feature-level or direction-level monitoring, because suppressing trajectory signatures requires altering the entire computation path rather than specific features or directions" — but only extract this when backed by frontier model evidence
Context: This is at the frontier of emerging interpretability work. If it gets arXiv endorsement and subsequent publication, it could represent the leading edge of the monitoring approach that addresses the SAE/SCAV dual-use problem. Worth tracking.
Curator Notes (structured handoff for extractor)
PRIMARY CONNECTION: B4 active thread — crystallization-detection synthesis
WHY ARCHIVED: Potentially addresses the SAE dual-use problem through trajectory geometry; represents the "hardest-to-game" monitoring candidate currently visible; not yet peer-reviewed
EXTRACTION HINT: Do not extract yet — needs arXiv submission and ideally replication on frontier models. Re-archive when published. The monitoring architecture claim (trajectory geometry vs. feature/direction geometry) can be extracted from the synthesis of this + SCAV + Beaglehole when the full picture is clear.