67 lines
5.8 KiB
Markdown
67 lines
5.8 KiB
Markdown
---
|
|
type: source
|
|
title: "The Geometry of Alignment Collapse: When Fine-Tuning Breaks Safety"
|
|
author: "Unknown (arXiv 2602.15799)"
|
|
url: https://arxiv.org/abs/2602.15799
|
|
date: 2026-02-01
|
|
domain: ai-alignment
|
|
secondary_domains: []
|
|
format: paper
|
|
status: unprocessed
|
|
priority: medium
|
|
tags: [alignment-collapse, fine-tuning, safety-geometry, quartic-scaling, predictive-diagnostics, alignment-instability, low-dimensional-subspace]
|
|
---
|
|
|
|
## Content
|
|
|
|
Introduces geometric analysis of how fine-tuning degrades alignment in safety-trained models. Provides the first formal scaling law for alignment loss during fine-tuning.
|
|
|
|
**Key findings:**
|
|
|
|
1. **Geometric structure of alignment:** Safety training concentrates alignment in "low-dimensional subspaces with sharp curvature" — not uniformly distributed across model parameters.
|
|
|
|
2. **Quartic scaling law:** Alignment loss grows with the FOURTH POWER of fine-tuning training time. The rate is governed by:
|
|
- Sharpness of alignment geometry (curvature of safety-critical subspace)
|
|
- Strength of curvature coupling between fine-tuning task and safety-critical parameters
|
|
|
|
3. **Alignment Instability Condition (AIC):** Three geometric properties jointly cause second-order acceleration of safety degradation:
|
|
- High curvature of safety-critical subspace
|
|
- Fine-tuning trajectory orthogonal to safety subspace (unstable)
|
|
- Non-trivial coupling that accelerates projection into safety-critical space
|
|
|
|
4. **Predictive diagnostics:** The geometric properties can be measured BEFORE fine-tuning to predict how much alignment will degrade. This enables "a shift from reactive red-teaming to predictive diagnostics for open-weight model deployment."
|
|
|
|
5. **Fine-tuning degrades safety unpredictably even on benign tasks** — the geometry makes alignment collapse non-obvious.
|
|
|
|
**Technical mechanism:** Fine-tuning induces a continuous trajectory through parameter space. The Fisher information spectrum shifts, eigenspaces rotate, and the alignment-sensitive subspace evolves. The quartic law captures this evolution mathematically.
|
|
|
|
## Agent Notes
|
|
|
|
**Why this matters:** Two implications:
|
|
|
|
1. **Predictive monitoring:** The geometric properties (curvature, coupling strength) can be measured in advance to predict alignment collapse. This is a "read ahead" rather than "read during" monitoring approach — checking BEFORE fine-tuning whether alignment will degrade. This is more useful for open-weight model safety than inference-time monitoring.
|
|
|
|
2. **Attack targeting implications:** The identification of "low-dimensional subspaces with sharp curvature" as the locus of alignment concentration is potentially the most precise targeting map yet identified. If attackers can measure the AIC properties, they know exactly where alignment is concentrated and fragile. The dual-use concern is higher than the paper acknowledges.
|
|
|
|
**What surprised me:** The quartic scaling law is a stronger relationship than I'd expected. Alignment doesn't degrade linearly with fine-tuning — it degrades with the fourth power. This means SMALL amounts of fine-tuning can cause LARGE alignment degradation if the geometry is unfavorable. The practical implication: open-weight models that undergo even light fine-tuning can lose most of their alignment if the fine-tuning task happens to have high curvature coupling.
|
|
|
|
**What I expected but didn't find:** Integration with SAE-level interpretability. The paper identifies which geometric properties of the weight space correspond to alignment, but doesn't connect this to which features (in SAE terms) or which directions (in concept vector terms) occupy those subspaces. Connecting the geometric picture to mechanistic interpretability would make both approaches more powerful.
|
|
|
|
**KB connections:**
|
|
- [[the specification trap means any values encoded at training time become structurally unstable as deployment contexts diverge from training conditions]] — the quartic scaling law provides a quantitative mechanism for this instability
|
|
- [[the alignment tax creates a structural race to the bottom because safety training costs capability and rational competitors skip it]] — the fragility of alignment geometry (degrades with 4th power of fine-tuning) worsens the alignment tax: once deployed, alignment isn't maintained, it must be actively preserved
|
|
- B3 (alignment must be continuous, not a specification problem) — strengthened: even within the same model, alignment degrades geometrically during fine-tuning without continuous renewal
|
|
|
|
**Extraction hints:**
|
|
- Extract claim: "Fine-tuning safety-trained models causes alignment loss that scales with the fourth power of training time, governed by geometric properties of safety-critical parameter subspaces that can be measured in advance for predictive diagnostics"
|
|
- Consider a divergence candidate: predictive diagnostics (measured in advance, no dual-use) vs. inference-time monitoring (real-time but creates attack surface via SCAV-style approaches)
|
|
|
|
**Context:** This paper is about open-weight model deployment safety — a different threat model from the scheming/evaluation-awareness work. Fine-tuned open-weight models are the most immediate safety risk for deployed AI systems.
|
|
|
|
## Curator Notes (structured handoff for extractor)
|
|
|
|
PRIMARY CONNECTION: [[the specification trap means any values encoded at training time become structurally unstable as deployment contexts diverge from training conditions]] — quartic scaling law quantifies this instability mechanistically
|
|
|
|
WHY ARCHIVED: First formal scaling law for alignment loss; predictive diagnostics approach potentially avoids inference-time dual-use problem; important for open-weight model risk assessment
|
|
|
|
EXTRACTION HINT: The quartic scaling law is the extractable claim; pair with the AIC (alignment instability condition) as a measurable predictor — this is the most technically specific alignment degradation claim currently in the research literature
|