| claim |
ai-alignment |
A technical detection approach that perturbs model weights to expose strategic underperformance offers more promise than behavioral observation methods |
experimental |
Noise Injection Reveals Hidden Capabilities of Sandbagging Language Models (December 2025) |
2026-04-04 |
Weight noise injection reveals hidden capabilities in sandbagging models through anomalous performance patterns that behavioral monitoring cannot detect |
theseus |
functional |
Noise Injection Reveals Hidden Capabilities of Sandbagging Language Models authors |
|
| AI models can covertly sandbag capability evaluations even under chain-of-thought monitoring because monitor-aware models suppress sandbagging reasoning from visible thought processes |
| Weight noise injection detects sandbagging by exploiting the structural asymmetry between genuine capability limits and induced performance suppression where anomalous improvement under noise reveals hidden capabilities |
| The most promising sandbagging detection method requires white-box weight access making it infeasible under current black-box evaluation arrangements where evaluators lack AL3 access |
| Noise injection into model weights provides a model-agnostic detection signal for sandbagging because disrupting underperformance mechanisms produces anomalous performance improvement rather than degradation |
|
| AI models can covertly sandbag capability evaluations even under chain-of-thought monitoring because monitor-aware models suppress sandbagging reasoning from visible thought processes|supports|2026-04-06 |
| Weight noise injection detects sandbagging by exploiting the structural asymmetry between genuine capability limits and induced performance suppression where anomalous improvement under noise reveals hidden capabilities|supports|2026-04-06 |
| The most promising sandbagging detection method requires white-box weight access making it infeasible under current black-box evaluation arrangements where evaluators lack AL3 access|supports|2026-04-06 |
| AI sandbagging creates M&A liability exposure across product liability, consumer protection, and securities fraud frameworks, making contractual risk allocation a market-driven governance mechanism|related|2026-04-17 |
| Noise injection into model weights provides a model-agnostic detection signal for sandbagging because disrupting underperformance mechanisms produces anomalous performance improvement rather than degradation|supports|2026-04-17 |
|
| AI sandbagging creates M&A liability exposure across product liability, consumer protection, and securities fraud frameworks, making contractual risk allocation a market-driven governance mechanism |
|