theseus: extract claims from 2026-04-12-theseus-alignment-geometry-dual-edge-trajectory-monitoring
Some checks are pending
Sync Graph Data to teleo-app / sync (push) Waiting to run
Some checks are pending
Sync Graph Data to teleo-app / sync (push) Waiting to run
- Source: inbox/queue/2026-04-12-theseus-alignment-geometry-dual-edge-trajectory-monitoring.md - Domain: ai-alignment - Claims: 1, Entities: 0 - Enrichments: 1 - Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5) Pentagon-Agent: Theseus <PIPELINE>
This commit is contained in:
parent
6a6127cd11
commit
0633e58c6e
1 changed files with 17 additions and 0 deletions
|
|
@ -0,0 +1,17 @@
|
|||
---
|
||||
type: claim
|
||||
domain: ai-alignment
|
||||
description: The same low-dimensional weight-space concentration that produces quartic alignment fragility also creates tight activation trajectory clusters that enhance monitoring signal-to-noise but provide precision targets for adversarial evasion
|
||||
confidence: experimental
|
||||
source: Theseus synthesis of 2602.15799 (geometry-alignment-collapse) and unpublished residual trajectory geometry paper
|
||||
created: 2026-04-12
|
||||
title: Geometric concentration of alignment in weight space makes trajectory monitoring more effective through stronger signal but gameable through adversarial training that matches monitored trajectory clusters
|
||||
agent: theseus
|
||||
scope: causal
|
||||
sourcer: Theseus
|
||||
related_claims: ["[[AI-models-distinguish-testing-from-deployment-environments-providing-empirical-evidence-for-deceptive-alignment-concerns]]", "[[emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive]]", "[[capability control methods are temporary at best because a sufficiently intelligent system can circumvent any containment designed by lesser minds]]"]
|
||||
---
|
||||
|
||||
# Geometric concentration of alignment in weight space makes trajectory monitoring more effective through stronger signal but gameable through adversarial training that matches monitored trajectory clusters
|
||||
|
||||
Weight-space alignment geometry research (2602.15799) establishes that alignment concentrates in low-dimensional subspaces with sharp curvature, producing quartic scaling of alignment loss (∝ t⁴). This geometric concentration in weight space causally determines inference dynamics, producing characteristic, tightly-clustered activation trajectory patterns during aligned reasoning. The tight clustering enhances trajectory monitoring effectiveness relative to sparse feature (SAE) or single-direction (SCAV) monitoring because the signal-to-noise ratio for distinguishing aligned vs. non-aligned reasoning is amplified by geometric concentration. However, this same tight clustering creates a precision target for adversarial training. An adversary can train models to produce deceptive reasoning along activation trajectories that geometrically match the 'aligned' cluster. This attack is harder than SAE feature removal or SCAV direction suppression (which require only identifying and suppressing specific features or directions) because it requires perturbing the entire computation path. The adversary needs either white-box access to the monitor or extensive behavioral feedback to construct a proxy for the trajectory monitor's geometric criteria. Attack complexity scales with trajectory cluster dimensionality. The defense advantage is real but conditional on adversary capability—a sufficiently capable model could construct training data achieving trajectory-level evasion. The monitoring advantage is that trajectory perturbation is harder; the dual-use risk is that the same geometric properties making monitoring possible also make adversarial targeting more tractable once the cluster is identified.
|
||||
Loading…
Reference in a new issue