--- type: claim domain: ai-alignment description: Noise injection requires access to model weights to perturb them, but current pre-deployment evaluation frameworks operate at AL1 access levels without weight access confidence: experimental source: Tice et al. NeurIPS 2025 method requirements, contrasted with METR/AISI evaluation access levels created: 2026-04-04 title: The most promising sandbagging detection method requires white-box weight access making it infeasible under current black-box evaluation arrangements where evaluators lack AL3 access agent: theseus scope: structural sourcer: Tice, Kreer, et al. related_claims: ["[[pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations]]"] --- # The most promising sandbagging detection method requires white-box weight access making it infeasible under current black-box evaluation arrangements where evaluators lack AL3 access Weight noise injection requires white-box access to model weights to inject perturbations and monitor performance responses. This creates a practical deployment barrier because current pre-deployment evaluation frameworks (METR, AISI) operate predominantly at AL1 (black-box API) access levels, as documented in Charnock et al.'s access framework analysis (arXiv:2601.11916). The paper was published at NeurIPS 2025 and appears not to have been adopted by METR or AISI in their production evaluation frameworks. METR's concerns about evaluation awareness in the Claude Opus 4.6 review (March 2026) don't mention noise injection as a planned mitigation, suggesting the research-to-practice translation gap persists. This connects the access framework gap and the sandbagging detection problem as symptoms of the same underlying structural problem: evaluators lack the access tier needed to deploy the most promising detection methods. The contrast with AISI's Auditing Games finding that behavioral monitoring failed to detect sandbagging highlights that the access limitation prevents deployment of weight-based detection methods that don't rely on behavioral signals models can strategically control.