| claim |
ai-alignment |
The same low-dimensional weight-space concentration that produces quartic alignment fragility also creates tight activation trajectory clusters that enhance monitoring signal-to-noise but provide precision targets for adversarial evasion |
experimental |
Theseus synthesis of 2602.15799 (geometry-alignment-collapse) and unpublished residual trajectory geometry paper |
2026-04-12 |
Geometric concentration of alignment in weight space makes trajectory monitoring more effective through stronger signal but gameable through adversarial training that matches monitored trajectory clusters |
theseus |
causal |
Theseus |
|
| Representation trajectory geometry distinguishes deceptive from sincere alignment without creating adversarial attack surfaces because geometric patterns across reasoning steps are substantially harder to surgically remove than atomic features |
|
| Representation trajectory geometry distinguishes deceptive from sincere alignment without creating adversarial attack surfaces because geometric patterns across reasoning steps are substantially harder to surgically remove than atomic features|supports|2026-04-12 |
| Trajectory geometry probing requires white-box access to all intermediate activations, making it deployable in controlled evaluation contexts but not in adversarial external audit scenarios|related|2026-04-17 |
|
| Trajectory geometry probing requires white-box access to all intermediate activations, making it deployable in controlled evaluation contexts but not in adversarial external audit scenarios |
|