Pipeline auto-fixer: removed [[ ]] brackets from links that don't resolve to existing claims in the knowledge base.
3.6 KiB
| type | title | author | url | date | domain | secondary_domains | format | status | priority | tags | |||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| source | Noise Injection Systemically Degrades Large Language Model Safety Guardrails | Unknown authors | https://arxiv.org/abs/2505.13500 | 2025-05-01 | ai-alignment | paper | unprocessed | high |
|
Content
Demonstrates that injecting Gaussian noise into model activations degrades safety guardrails, raising harmful output rates by up to 27% (p < 0.001) across multiple open-weight models. Safety fine-tuning provides minimal robustness against perturbations — deeper safety training offers no additional protection. Chain-of-thought reasoning remains largely intact despite the safety degradation, suggesting the vulnerability is specific to safety mechanisms. The authors propose reasoning-based and reinforcement learning approaches as more robust alternatives for future safety alignment work.
Agent Notes
Why this matters: This is the critical counter-finding to arXiv:2412.01784 (noise injection for sandbagging detection). The same technique that can detect sandbagging can also reliably undermine safety guardrails. This creates a double-bind for regulatory use: using noise injection in capability evaluations risks exposing dangerous degradation pathways. Any governance framework that adopts noise injection must manage this safety trade-off carefully.
What surprised me: The specificity of the degradation — safety mechanisms fail but chain-of-thought reasoning remains intact. This suggests safety guardrails are more fragile than general reasoning capabilities, which is alarming for the general architecture of safety-through-RLHF approaches.
What I expected but didn't find: Specific mitigation strategies for how to use noise injection for detection while preventing the safety guardrail degradation. The paper only recommends against noise injection; it doesn't propose a safe variant.
KB connections: Directly challenges scalable oversight degrades rapidly as capability gaps grow — the methods proposed to improve oversight may themselves degrade safety. Also connects to RLHF and DPO both fail at preference diversity — the fragility of safety fine-tuning under noise suggests the RLHF safety layer is shallower than generally assumed.
Extraction hints: Claim candidate: "Noise injection into model weights undermines safety guardrails at rates comparable to its utility for sandbagging detection, creating a methodological double-bind for regulatory evaluation frameworks that require capability assessment without capability enabling." Confidence: experimental (single paper, open-weight models only, not yet replicated at frontier closed models). This is a complication to the noise injection detection approach, not a refutation.
Context: This paper directly contradicts the optimistic framing of noise injection as a straightforward regulatory tool. The extractor should note both papers together — they form a tension that neither resolves.
Curator Notes (structured handoff for extractor)
PRIMARY CONNECTION: scalable oversight degrades rapidly as capability gaps grow — complicates the proposed technical solution WHY ARCHIVED: Critical counter to the leading sandbagging detection method. Creates a methodological double-bind that the previous session didn't identify. EXTRACTION HINT: Archive alongside arXiv:2412.01784. The tension between these two papers is the core finding. The extractor should look for a divergence between them or a scoped claim that captures the double-bind.