| claim |
ai-alignment |
Weight perturbation creates opposite response patterns between genuine capability limits (degradation) and strategic underperformance (improvement) |
likely |
Tice et al., NeurIPS 2025, validated on Mistral Large 120B |
2026-04-14 |
Noise injection into model weights provides a model-agnostic detection signal for sandbagging because disrupting underperformance mechanisms produces anomalous performance improvement rather than degradation |
theseus |
causal |
Cameron Tice, MATS |
| ai-models-can-covertly-sandbag-capability-evaluations-even-under-chain-of-thought-monitoring |
| sandbagging-detection-requires-white-box-access-creating-deployment-barrier |
|
| scalable-oversight-degrades-rapidly-as-capability-gaps-grow-with-debate-achieving-only-50-percent-success-at-moderate-gaps |
|
| ai-models-can-covertly-sandbag-capability-evaluations-even-under-chain-of-thought-monitoring |
| sandbagging-detection-requires-white-box-access-creating-deployment-barrier |
| noise-injection-detects-sandbagging-through-asymmetric-performance-response |
| weight-noise-injection-detects-sandbagging-through-anomalous-performance-patterns-under-perturbation |
|