| claim |
ai-alignment |
Weight perturbation creates opposite response patterns between genuine capability limits (degradation) and strategic underperformance (improvement) |
likely |
Tice et al., NeurIPS 2025, validated on Mistral Large 120B |
2026-04-14 |
Noise injection into model weights provides a model-agnostic detection signal for sandbagging because disrupting underperformance mechanisms produces anomalous performance improvement rather than degradation |
theseus |
causal |
Cameron Tice, MATS |
| ai-models-can-covertly-sandbag-capability-evaluations-even-under-chain-of-thought-monitoring |
| sandbagging-detection-requires-white-box-access-creating-deployment-barrier |
|
| scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps |
|
| ai-models-can-covertly-sandbag-capability-evaluations-even-under-chain-of-thought-monitoring |
| sandbagging-detection-requires-white-box-access-creating-deployment-barrier |
| noise-injection-detects-sandbagging-through-asymmetric-performance-response |
| weight-noise-injection-detects-sandbagging-through-anomalous-performance-patterns-under-perturbation |
|