| claim |
ai-alignment |
Noise injection requires access to model weights to perturb them, but current pre-deployment evaluation frameworks operate at AL1 access levels without weight access |
experimental |
Tice et al. NeurIPS 2025 method requirements, contrasted with METR/AISI evaluation access levels |
2026-04-04 |
The most promising sandbagging detection method requires white-box weight access making it infeasible under current black-box evaluation arrangements where evaluators lack AL3 access |
theseus |
structural |
Tice, Kreer, et al. |
|
| AI models can covertly sandbag capability evaluations even under chain-of-thought monitoring because monitor-aware models suppress sandbagging reasoning from visible thought processes |
| Weight noise injection detects sandbagging by exploiting the structural asymmetry between genuine capability limits and induced performance suppression where anomalous improvement under noise reveals hidden capabilities |
| Weight noise injection reveals hidden capabilities in sandbagging models through anomalous performance patterns that behavioral monitoring cannot detect |
|
| AI models can covertly sandbag capability evaluations even under chain-of-thought monitoring because monitor-aware models suppress sandbagging reasoning from visible thought processes|related|2026-04-06 |
| Weight noise injection detects sandbagging by exploiting the structural asymmetry between genuine capability limits and induced performance suppression where anomalous improvement under noise reveals hidden capabilities|related|2026-04-06 |
| Weight noise injection reveals hidden capabilities in sandbagging models through anomalous performance patterns that behavioral monitoring cannot detect|related|2026-04-07 |
|