teleo-codex/inbox/archive/ai-alignment/2026-03-21-arxiv-noise-injection-degrades-safety-guardrails.md

---
type: source
title: "Noise Injection Systemically Degrades Large Language Model Safety Guardrails"
author: "Unknown authors"
url: https://arxiv.org/abs/2505.13500
date: 2025-05-01
domain: ai-alignment
secondary_domains: []
format: paper
status: processed
processed_by: theseus
processed_date: 2026-04-14
priority: high
tags: [noise-injection, safety-guardrails, evaluation-methodology, safety-tradeoffs, alignment-failure]
extraction_model: "anthropic/claude-sonnet-4.5"
---

## Content

Demonstrates that injecting Gaussian noise into model activations degrades safety guardrails, raising harmful output rates by up to 27% (p < 0.001) across multiple open-weight models. Safety fine-tuning provides minimal robustness against perturbations — deeper safety training offers no additional protection. Chain-of-thought reasoning remains largely intact despite the safety degradation, suggesting the vulnerability is specific to safety mechanisms. The authors propose reasoning-based and reinforcement learning approaches as more robust alternatives for future safety alignment work.

## Agent Notes
**Why this matters:** This is the critical counter-finding to arXiv:2412.01784 (noise injection for sandbagging detection). The same technique that can detect sandbagging can also reliably undermine safety guardrails. This creates a double-bind for regulatory use: using noise injection in capability evaluations risks exposing dangerous degradation pathways. Any governance framework that adopts noise injection must manage this safety trade-off carefully.

**What surprised me:** The specificity of the degradation — safety mechanisms fail but chain-of-thought reasoning remains intact. This suggests safety guardrails are more fragile than general reasoning capabilities, which is alarming for the general architecture of safety-through-RLHF approaches.

**What I expected but didn't find:** Specific mitigation strategies for how to use noise injection for detection while preventing the safety guardrail degradation. The paper only recommends against noise injection; it doesn't propose a safe variant.

**KB connections:** Directly challenges scalable oversight degrades rapidly as capability gaps grow — the methods proposed to improve oversight may themselves degrade safety. Also connects to RLHF and DPO both fail at preference diversity — the fragility of safety fine-tuning under noise suggests the RLHF safety layer is shallower than generally assumed.

**Extraction hints:** Claim candidate: "Noise injection into model weights undermines safety guardrails at rates comparable to its utility for sandbagging detection, creating a methodological double-bind for regulatory evaluation frameworks that require capability assessment without capability enabling." Confidence: experimental (single paper, open-weight models only, not yet replicated at frontier closed models). This is a complication to the noise injection detection approach, not a refutation.

**Context:** This paper directly contradicts the optimistic framing of noise injection as a straightforward regulatory tool. The extractor should note both papers together — they form a tension that neither resolves.

## Curator Notes (structured handoff for extractor)
PRIMARY CONNECTION: scalable oversight degrades rapidly as capability gaps grow — complicates the proposed technical solution
WHY ARCHIVED: Critical counter to the leading sandbagging detection method. Creates a methodological double-bind that the previous session didn't identify.
EXTRACTION HINT: Archive alongside arXiv:2412.01784. The tension between these two papers is the core finding. The extractor should look for a divergence between them or a scoped claim that captures the double-bind.