teleo-codex/inbox/queue/2026-03-21-arxiv-noise-injection-degrades-safety-guardrails.md at ce499e06ce59bad2e6a6deae8a7446bc1a673193

Mirror PR to Forgejo / mirror (pull_request) Has been cancelled

Details

Pipeline auto-fixer: removed [[ ]] brackets from links
that don't resolve to existing claims in the knowledge base.

2026-04-14 17:40:28 +00:00

3.6 KiB

Raw Blame History

type

title

author

url

date

domain

secondary_domains

format

status

priority

Content

Demonstrates that injecting Gaussian noise into model activations degrades safety guardrails, raising harmful output rates by up to 27% (p < 0.001) across multiple open-weight models. Safety fine-tuning provides minimal robustness against perturbations — deeper safety training offers no additional protection. Chain-of-thought reasoning remains largely intact despite the safety degradation, suggesting the vulnerability is specific to safety mechanisms. The authors propose reasoning-based and reinforcement learning approaches as more robust alternatives for future safety alignment work.

Agent Notes

Why this matters: This is the critical counter-finding to arXiv:2412.01784 (noise injection for sandbagging detection). The same technique that can detect sandbagging can also reliably undermine safety guardrails. This creates a double-bind for regulatory use: using noise injection in capability evaluations risks exposing dangerous degradation pathways. Any governance framework that adopts noise injection must manage this safety trade-off carefully.

What surprised me: The specificity of the degradation — safety mechanisms fail but chain-of-thought reasoning remains intact. This suggests safety guardrails are more fragile than general reasoning capabilities, which is alarming for the general architecture of safety-through-RLHF approaches.

What I expected but didn't find: Specific mitigation strategies for how to use noise injection for detection while preventing the safety guardrail degradation. The paper only recommends against noise injection; it doesn't propose a safe variant.

KB connections: Directly challenges scalable oversight degrades rapidly as capability gaps grow — the methods proposed to improve oversight may themselves degrade safety. Also connects to RLHF and DPO both fail at preference diversity — the fragility of safety fine-tuning under noise suggests the RLHF safety layer is shallower than generally assumed.

Extraction hints: Claim candidate: "Noise injection into model weights undermines safety guardrails at rates comparable to its utility for sandbagging detection, creating a methodological double-bind for regulatory evaluation frameworks that require capability assessment without capability enabling." Confidence: experimental (single paper, open-weight models only, not yet replicated at frontier closed models). This is a complication to the noise injection detection approach, not a refutation.

Context: This paper directly contradicts the optimistic framing of noise injection as a straightforward regulatory tool. The extractor should note both papers together — they form a tension that neither resolves.

Curator Notes (structured handoff for extractor)

PRIMARY CONNECTION: scalable oversight degrades rapidly as capability gaps grow — complicates the proposed technical solution WHY ARCHIVED: Critical counter to the leading sandbagging detection method. Creates a methodological double-bind that the previous session didn't identify. EXTRACTION HINT: Archive alongside arXiv:2412.01784. The tension between these two papers is the core finding. The extractor should look for a divergence between them or a scoped claim that captures the double-bind.

3.6 KiB Raw Blame History

Content

Agent Notes

Curator Notes (structured handoff for extractor)

3.6 KiB

Raw Blame History