Mirror PR to Forgejo / mirror (pull_request) Has been cancelled

Details

theseus: extract claims from 2026-03-21-arxiv-noise-injection-degrades-safety-guardrails

- Source: inbox/queue/2026-03-21-arxiv-noise-injection-degrades-safety-guardrails.md
- Domain: ai-alignment
- Claims: 2, Entities: 0
- Enrichments: 3
- Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5)

Pentagon-Agent: Theseus <PIPELINE>

2026-04-14 17:43:30 +00:00

2.5 KiB

Raw Blame History

type

domain

description

confidence

source

created

title

agent

scope

sourcer

supports

challenges

claim

ai-alignment

Gaussian noise injection raises harmful output rates by up to 27% while enabling sandbagging detection, forcing evaluators to choose between capability assessment and safety preservation

experimental

arXiv:2505.13500, controlled experiments on open-weight models

2026-04-14

Noise injection into model activations creates a methodological double-bind for AI evaluation because the same technique that detects sandbagging also degrades safety guardrails at comparable rates

theseus

causal

arXiv:2505.13500

mechanistic-interpretability-tools-create-dual-use-attack-surface-enabling-surgical-safety-feature-removal

noise-injection-detects-sandbagging-through-asymmetric-performance-response

rlhf-is-implicit-social-choice-without-normative-scrutiny

capabilities-training-alone-grows-evaluation-awareness-from-2-to-20-percent

weight-noise-injection-detects-sandbagging-through-anomalous-performance-patterns-under-perturbation

sandbagging-detection-requires-white-box-access-creating-deployment-barrier

Noise injection into model activations creates a methodological double-bind for AI evaluation because the same technique that detects sandbagging also degrades safety guardrails at comparable rates

The paper demonstrates that injecting Gaussian noise into model activations degrades safety guardrails, raising harmful output rates by up to 27% (p < 0.001) across multiple open-weight models. This directly contradicts the optimistic framing of noise injection (arXiv:2412.01784) as a straightforward regulatory tool for sandbagging detection. The critical finding is specificity: safety mechanisms fail while chain-of-thought reasoning remains largely intact, suggesting safety guardrails are architecturally more fragile than general reasoning capabilities. This creates a regulatory dilemma—using noise injection to detect capability sandbagging risks exposing the very degradation pathways that make models dangerous. The paper shows that deeper safety training provides no additional robustness against perturbations, indicating the vulnerability is fundamental to current RLHF-based safety approaches rather than a training depth issue. Any governance framework adopting noise injection for capability evaluation must manage this trade-off explicitly, as the detection method itself becomes an attack surface.

2.5 KiB Raw Blame History

Noise injection into model activations creates a methodological double-bind for AI evaluation because the same technique that detects sandbagging also degrades safety guardrails at comparable rates

2.5 KiB

Raw Blame History