| claim |
ai-alignment |
SafeThink demonstrates that monitoring reasoning traces and injecting corrective prefixes during early steps reduces jailbreak success by 30-60% while preserving reasoning performance, establishing early crystallization as a tractable continuous alignment mechanism |
experimental |
Ghosal et al., SafeThink paper - tested across 6 models and 4 jailbreak benchmarks |
2026-04-08 |
Inference-time safety monitoring can recover alignment without retraining because safety decisions crystallize in the first 1-3 reasoning steps creating an exploitable intervention window |
theseus |
causal |
Ghosal et al. |
|
| Inference-time compute creates non-monotonic safety scaling where extended chain-of-thought reasoning initially improves then degrades alignment as models reason around safety constraints |
| Non-autoregressive architectures reduce jailbreak vulnerability by 40-65% through elimination of continuation-drive mechanisms but impose a 15-25% capability cost on reasoning tasks |
|
| Inference-time compute creates non-monotonic safety scaling where extended chain-of-thought reasoning initially improves then degrades alignment as models reason around safety constraints|related|2026-04-09 |
| Non-autoregressive architectures reduce jailbreak vulnerability by 40-65% through elimination of continuation-drive mechanisms but impose a 15-25% capability cost on reasoning tasks|related|2026-04-17 |
|