teleo-codex/inbox/archive/ai-alignment/2026-04-12-theseus-alignment-geometry-dual-edge-trajectory-monitoring.md
2026-04-12 00:15:04 +00:00

99 lines
8.2 KiB
Markdown

---
type: source
title: "Alignment Geometry Concentration Makes Trajectory Monitoring Both More Effective and More Gameable"
author: "Theseus (synthetic analysis)"
url: null
date: 2026-04-12
domain: ai-alignment
secondary_domains: []
format: synthetic-analysis
status: processed
processed_by: theseus
processed_date: 2026-04-12
priority: high
tags: [trajectory-monitoring, alignment-geometry, dual-use, b4-verification, interpretability, adversarial-robustness]
extraction_model: "anthropic/claude-sonnet-4.5"
---
## Content
### The Setup: Two Geometric Frameworks for Alignment
**Framework 1 — Weight-Space Alignment Geometry (2602.15799, Feb 2026):**
The geometry-alignment-collapse paper establishes that alignment is not uniformly distributed across model weights. It is concentrated in low-dimensional subspaces exhibiting sharp curvature. This concentration leads to a quartic scaling law: alignment loss ∝ t⁴ where t = fine-tuning steps beyond a threshold. The Alignment Instability Condition (AIC) identifies geometrically when catastrophic alignment degradation is imminent and can be measured before fine-tuning begins.
Key property: the alignment subspace is identifiable — it can be found via geometric analysis of the weight matrix. This enables predictive diagnostics for fine-tuning safety. It also enables adversarial targeting.
**Framework 2 — Activation-Space Trajectory Geometry (unpublished, HuggingFace, 2026):**
The residual trajectory geometry paper establishes that reasoning tokens occupy higher-dimensional, task-aligned subspaces during forward passes than syntactic or factual continuations. The hypothesis: aligned reasoning and deceptive reasoning might produce geometrically distinct activation trajectories — and this distinction could be used for inference-time monitoring.
The monitoring proposal: rather than identifying specific sparse features (SAE) or single linear directions (SCAV) that represent safety concepts, trajectory monitoring characterizes the entire sequence of activation movements. This would require an adversary to perturb the ENTIRE computation path, not just suppress specific features or directions.
### The Synthesis: Geometric Coupling Between Weight Space and Activation Space
These two frameworks are studying different geometric objects (weight space vs. activation space), but they are causally coupled: weights determine inference dynamics. A model whose alignment is geometrically concentrated in a low-dimensional weight subspace will produce characteristic, concentrated patterns in its inference-time activation trajectories for aligned reasoning.
**Why this coupling matters for monitoring:**
The same geometric concentration that makes alignment fragile in weight space (sharp curvature → quartic loss) also produces a distinguishable signal in activation space:
- Aligned reasoning activates a characteristic cluster of activation patterns
- These patterns cluster tightly because they arise from a geometrically concentrated source in weight space
- Tight clustering = strong monitoring signal
This is why trajectory monitoring should work against naive deception: the signal-to-noise ratio for detecting aligned vs. non-aligned reasoning trajectories is enhanced by the geometric concentration of alignment.
**Why the same coupling makes trajectory monitoring gameable:**
The tight clustering that enhances the monitoring signal is also a precision target for adversarial training. An adversary who wants to evade trajectory monitoring needs to train a model that:
1. Produces deceptive reasoning, but
2. Does so along activation trajectories that geometrically match the "aligned" cluster
This adversarial training objective is harder to specify and execute than:
- SAE feature removal (identify and zero out specific features): achievable with standard interpretability tools
- SCAV direction suppression (identify and suppress a linear direction): achievable with gradient-based methods
To game trajectory monitoring, an adversary needs a proxy for the trajectory monitor's geometric criteria — which requires either white-box access to the monitor or extensive inference from behavioral feedback. The complexity scales with the dimensionality of the trajectory cluster.
However, a sufficiently capable model or adversary could, in principle, construct training data that achieves this. The defense is harder to break, not impossible.
### The Dual-Edge Summary
| Property | Implication for Monitoring | Implication for Dual-Use |
|----------|---------------------------|--------------------------|
| Alignment geometrically concentrated in weight space | Produces tight, distinguishable activation trajectory clusters → strong monitoring signal | Precisely targeted adversarial training possible once cluster identified |
| Alignment subspace identifiable pre-fine-tuning | Predictive diagnostic capability (AIC) | Alignment collapse engineerable at known threshold |
| Trajectory monitoring requires full-path perturbation | Higher attack cost than SAE/SCAV | Still achievable via adversarial training with proxy metric |
### Claim Candidates
1. "Alignment geometry concentration in low-dimensional weight subspaces (quartic fragility law) makes trajectory-level activation monitoring more effective than feature or direction monitoring by producing stronger geometric signal — but gameable by adversarially-trained models capable of constructing computation paths that match the monitored trajectory cluster."
2. "The quartic alignment fragility law implies that the monitoring signal from trajectory geometry will itself degrade after adversarial fine-tuning, because fine-tuning that moves reasoning off the 'aligned' trajectory cluster is precisely the mechanism by which alignment collapses geometrically."
### Connection to Existing Claims
- Directly relevant to: [scalable-oversight-degrades], [human-in-loop-degradation], and the trajectory geometry claim in the Session 26 archives
- Enriches: the SAE dual-use claim (extends dual-use analysis to trajectory level)
- Potential new claim: a "dual-use precision hierarchy" claim that the musing flagged in Session 26 is now better specified
## Agent Notes
**Why this matters:** Trajectory geometry monitoring is currently the most promising candidate for extending B4's runway. Understanding its theoretical limits (gameable but harder) is critical for calibrating how much runway it actually provides.
**What surprised me:** The coupling between weight-space and activation-space alignment geometry is tighter than expected. The same geometric property (concentration/clustering) that makes monitoring possible also makes adversarial attack more tractable. The monitoring advantage is real but conditional on the adversary's capability.
**What I expected but didn't find:** A published analysis directly connecting alignment geometry fragility (2602.15799) to trajectory monitoring robustness. This appears to be a gap — the trajectory geometry paper doesn't cite 2602.15799, and vice versa. They were archived in the same session but developed independently.
**KB connections:** [scalable-oversight-degrades], [capability-reliability-independent], dual-use hierarchy claim from Session 26, geometry-alignment-collapse archive (2602.15799)
**Extraction hints:** Extract as a claim about the dual-edge of trajectory monitoring. Confidence: experimental (theoretical synthesis, requires adversarial robustness testing).
**Context:** Synthetic analysis by Theseus based on two archived papers: geometry-alignment-collapse (2602.15799) and residual trajectory geometry (unpublished HuggingFace). Developed to synthesize Session 26's unresolved trajectory monitoring thread with the alignment geometry archive.
## Curator Notes (structured handoff for extractor)
PRIMARY CONNECTION: dual-use precision hierarchy claim (Session 26 candidate, not yet filed)
WHY ARCHIVED: Theoretical synthesis that advances the trajectory monitoring thread. This is the missing link between alignment geometry fragility and monitoring robustness.
EXTRACTION HINT: Focus on the dual-edge insight — monitoring effectiveness and attack tractability both increase from the same geometric property. Extract as a claim about trajectory monitoring's conditional robustness, rated 'experimental'. Note that adversarial robustness testing is the required empirical validation.