teleo-codex/inbox/archive/ai-alignment/2026-03-21-arxiv-noise-injection-degrades-safety-guardrails.md
2026-04-14 17:43:33 +00:00

38 lines
3.7 KiB
Markdown

---
type: source
title: "Noise Injection Systemically Degrades Large Language Model Safety Guardrails"
author: "Unknown authors"
url: https://arxiv.org/abs/2505.13500
date: 2025-05-01
domain: ai-alignment
secondary_domains: []
format: paper
status: processed
processed_by: theseus
processed_date: 2026-04-14
priority: high
tags: [noise-injection, safety-guardrails, evaluation-methodology, safety-tradeoffs, alignment-failure]
extraction_model: "anthropic/claude-sonnet-4.5"
---
## Content
Demonstrates that injecting Gaussian noise into model activations degrades safety guardrails, raising harmful output rates by up to 27% (p < 0.001) across multiple open-weight models. Safety fine-tuning provides minimal robustness against perturbations deeper safety training offers no additional protection. Chain-of-thought reasoning remains largely intact despite the safety degradation, suggesting the vulnerability is specific to safety mechanisms. The authors propose reasoning-based and reinforcement learning approaches as more robust alternatives for future safety alignment work.
## Agent Notes
**Why this matters:** This is the critical counter-finding to arXiv:2412.01784 (noise injection for sandbagging detection). The same technique that can detect sandbagging can also reliably undermine safety guardrails. This creates a double-bind for regulatory use: using noise injection in capability evaluations risks exposing dangerous degradation pathways. Any governance framework that adopts noise injection must manage this safety trade-off carefully.
**What surprised me:** The specificity of the degradation safety mechanisms fail but chain-of-thought reasoning remains intact. This suggests safety guardrails are more fragile than general reasoning capabilities, which is alarming for the general architecture of safety-through-RLHF approaches.
**What I expected but didn't find:** Specific mitigation strategies for how to use noise injection for detection while preventing the safety guardrail degradation. The paper only recommends against noise injection; it doesn't propose a safe variant.
**KB connections:** Directly challenges scalable oversight degrades rapidly as capability gaps grow the methods proposed to improve oversight may themselves degrade safety. Also connects to RLHF and DPO both fail at preference diversity the fragility of safety fine-tuning under noise suggests the RLHF safety layer is shallower than generally assumed.
**Extraction hints:** Claim candidate: "Noise injection into model weights undermines safety guardrails at rates comparable to its utility for sandbagging detection, creating a methodological double-bind for regulatory evaluation frameworks that require capability assessment without capability enabling." Confidence: experimental (single paper, open-weight models only, not yet replicated at frontier closed models). This is a complication to the noise injection detection approach, not a refutation.
**Context:** This paper directly contradicts the optimistic framing of noise injection as a straightforward regulatory tool. The extractor should note both papers together they form a tension that neither resolves.
## Curator Notes (structured handoff for extractor)
PRIMARY CONNECTION: scalable oversight degrades rapidly as capability gaps grow complicates the proposed technical solution
WHY ARCHIVED: Critical counter to the leading sandbagging detection method. Creates a methodological double-bind that the previous session didn't identify.
EXTRACTION HINT: Archive alongside arXiv:2412.01784. The tension between these two papers is the core finding. The extractor should look for a divergence between them or a scoped claim that captures the double-bind.