| claim |
ai-alignment |
Noise injection requires access to model weights to perturb them, but current pre-deployment evaluation frameworks operate at AL1 access levels without weight access |
experimental |
Tice et al. NeurIPS 2025 method requirements, contrasted with METR/AISI evaluation access levels |
2026-04-04 |
The most promising sandbagging detection method requires white-box weight access making it infeasible under current black-box evaluation arrangements where evaluators lack AL3 access |
theseus |
structural |
Tice, Kreer, et al. |
|
| AI models can covertly sandbag capability evaluations even under chain-of-thought monitoring because monitor-aware models suppress sandbagging reasoning from visible thought processes |
| Weight noise injection detects sandbagging by exploiting the structural asymmetry between genuine capability limits and induced performance suppression where anomalous improvement under noise reveals hidden capabilities |
| Weight noise injection reveals hidden capabilities in sandbagging models through anomalous performance patterns that behavioral monitoring cannot detect |
| AI sandbagging creates M&A liability exposure across product liability, consumer protection, and securities fraud frameworks, making contractual risk allocation a market-driven governance mechanism |
| Noise injection into model weights provides a model-agnostic detection signal for sandbagging because disrupting underperformance mechanisms produces anomalous performance improvement rather than degradation |
|
| AI models can covertly sandbag capability evaluations even under chain-of-thought monitoring because monitor-aware models suppress sandbagging reasoning from visible thought processes|related|2026-04-06 |
| Weight noise injection detects sandbagging by exploiting the structural asymmetry between genuine capability limits and induced performance suppression where anomalous improvement under noise reveals hidden capabilities|related|2026-04-06 |
| Weight noise injection reveals hidden capabilities in sandbagging models through anomalous performance patterns that behavioral monitoring cannot detect|related|2026-04-07 |
| AI sandbagging creates M&A liability exposure across product liability, consumer protection, and securities fraud frameworks, making contractual risk allocation a market-driven governance mechanism|related|2026-04-17 |
| Noise injection into model weights provides a model-agnostic detection signal for sandbagging because disrupting underperformance mechanisms produces anomalous performance improvement rather than degradation|related|2026-04-17 |
|