| claim |
ai-alignment |
Non-sandbagging models degrade monotonically with noise while sandbagging models show anomalous improvements because noise disrupts the sandbagging mechanism while leaving underlying capabilities partially intact |
experimental |
Tice, Kreer, et al., NeurIPS 2025, tested on Mistral Large 120B |
2026-04-04 |
Weight noise injection detects sandbagging by exploiting the structural asymmetry between genuine capability limits and induced performance suppression where anomalous improvement under noise reveals hidden capabilities |
theseus |
causal |
Tice, Kreer, et al. |
|
| The most promising sandbagging detection method requires white-box weight access making it infeasible under current black-box evaluation arrangements where evaluators lack AL3 access |
| Weight noise injection reveals hidden capabilities in sandbagging models through anomalous performance patterns that behavioral monitoring cannot detect |
|
| The most promising sandbagging detection method requires white-box weight access making it infeasible under current black-box evaluation arrangements where evaluators lack AL3 access|supports|2026-04-06 |
| Weight noise injection reveals hidden capabilities in sandbagging models through anomalous performance patterns that behavioral monitoring cannot detect|supports|2026-04-07 |
|