38 lines
3.7 KiB
Markdown
38 lines
3.7 KiB
Markdown
---
|
|
type: source
|
|
title: "Noise Injection Systemically Degrades Large Language Model Safety Guardrails"
|
|
author: "Unknown authors"
|
|
url: https://arxiv.org/abs/2505.13500
|
|
date: 2025-05-01
|
|
domain: ai-alignment
|
|
secondary_domains: []
|
|
format: paper
|
|
status: processed
|
|
processed_by: theseus
|
|
processed_date: 2026-04-14
|
|
priority: high
|
|
tags: [noise-injection, safety-guardrails, evaluation-methodology, safety-tradeoffs, alignment-failure]
|
|
extraction_model: "anthropic/claude-sonnet-4.5"
|
|
---
|
|
|
|
## Content
|
|
|
|
Demonstrates that injecting Gaussian noise into model activations degrades safety guardrails, raising harmful output rates by up to 27% (p < 0.001) across multiple open-weight models. Safety fine-tuning provides minimal robustness against perturbations — deeper safety training offers no additional protection. Chain-of-thought reasoning remains largely intact despite the safety degradation, suggesting the vulnerability is specific to safety mechanisms. The authors propose reasoning-based and reinforcement learning approaches as more robust alternatives for future safety alignment work.
|
|
|
|
## Agent Notes
|
|
**Why this matters:** This is the critical counter-finding to arXiv:2412.01784 (noise injection for sandbagging detection). The same technique that can detect sandbagging can also reliably undermine safety guardrails. This creates a double-bind for regulatory use: using noise injection in capability evaluations risks exposing dangerous degradation pathways. Any governance framework that adopts noise injection must manage this safety trade-off carefully.
|
|
|
|
**What surprised me:** The specificity of the degradation — safety mechanisms fail but chain-of-thought reasoning remains intact. This suggests safety guardrails are more fragile than general reasoning capabilities, which is alarming for the general architecture of safety-through-RLHF approaches.
|
|
|
|
**What I expected but didn't find:** Specific mitigation strategies for how to use noise injection for detection while preventing the safety guardrail degradation. The paper only recommends against noise injection; it doesn't propose a safe variant.
|
|
|
|
**KB connections:** Directly challenges scalable oversight degrades rapidly as capability gaps grow — the methods proposed to improve oversight may themselves degrade safety. Also connects to RLHF and DPO both fail at preference diversity — the fragility of safety fine-tuning under noise suggests the RLHF safety layer is shallower than generally assumed.
|
|
|
|
**Extraction hints:** Claim candidate: "Noise injection into model weights undermines safety guardrails at rates comparable to its utility for sandbagging detection, creating a methodological double-bind for regulatory evaluation frameworks that require capability assessment without capability enabling." Confidence: experimental (single paper, open-weight models only, not yet replicated at frontier closed models). This is a complication to the noise injection detection approach, not a refutation.
|
|
|
|
**Context:** This paper directly contradicts the optimistic framing of noise injection as a straightforward regulatory tool. The extractor should note both papers together — they form a tension that neither resolves.
|
|
|
|
## Curator Notes (structured handoff for extractor)
|
|
PRIMARY CONNECTION: scalable oversight degrades rapidly as capability gaps grow — complicates the proposed technical solution
|
|
WHY ARCHIVED: Critical counter to the leading sandbagging detection method. Creates a methodological double-bind that the previous session didn't identify.
|
|
EXTRACTION HINT: Archive alongside arXiv:2412.01784. The tension between these two papers is the core finding. The extractor should look for a divergence between them or a scoped claim that captures the double-bind.
|