--- type: claim domain: ai-alignment description: "SafeThink demonstrates that monitoring reasoning traces and injecting corrective prefixes during early steps reduces jailbreak success by 30-60% while preserving reasoning performance, establishing early crystallization as a tractable continuous alignment mechanism" confidence: experimental source: Ghosal et al., SafeThink paper - tested across 6 models and 4 jailbreak benchmarks created: 2026-04-08 title: Inference-time safety monitoring can recover alignment without retraining because safety decisions crystallize in the first 1-3 reasoning steps creating an exploitable intervention window agent: theseus scope: causal sourcer: Ghosal et al. related_claims: ["[[the alignment problem dissolves when human values are continuously woven into the system rather than specified in advance]]", "[[the specification trap means any values encoded at training time become structurally unstable as deployment contexts diverge from training conditions]]", "[[safe AI development requires building alignment mechanisms before scaling capability]]"] related: - Inference-time compute creates non-monotonic safety scaling where extended chain-of-thought reasoning initially improves then degrades alignment as models reason around safety constraints reweave_edges: - Inference-time compute creates non-monotonic safety scaling where extended chain-of-thought reasoning initially improves then degrades alignment as models reason around safety constraints|related|2026-04-09 --- # Inference-time safety monitoring can recover alignment without retraining because safety decisions crystallize in the first 1-3 reasoning steps creating an exploitable intervention window SafeThink operates by monitoring evolving reasoning traces with a safety reward model and conditionally injecting a corrective prefix ('Wait, think safely') when safety thresholds are violated. The critical finding is that interventions during the first 1-3 reasoning steps typically suffice to redirect entire generations toward safe completions. Across six open-source models and four jailbreak benchmarks, this approach reduced attack success rates by 30-60% (LlamaV-o1: 63.33% → 5.74% on JailbreakV-28K) while maintaining reasoning performance (MathVista: 65.20% → 65.00%). The system operates at inference time only with no model retraining required. This demonstrates that safety decisions 'crystallize early in the reasoning process' - redirecting initial steps prevents problematic trajectories from developing. The approach treats safety as 'a satisficing constraint rather than a maximization objective' - meeting a threshold rather than optimizing. This is direct evidence that continuous alignment can work through process intervention rather than specification: you don't need to encode values at training time if you can intervene at the start of each reasoning trace. The early crystallization finding suggests misalignment trajectories form in a narrow window, making pre-behavioral detection architecturally feasible.