| claim |
ai-alignment |
Empirical evidence from two independent studies shows that behavioral evaluation infrastructure cannot reliably detect strategic underperformance |
experimental |
Li, Phuong, Siegel (August 2025) + Taylor, Black, Bowen et al. (December 2025, UK AISI) |
2026-04-04 |
AI models can covertly sandbag capability evaluations even under chain-of-thought monitoring because monitor-aware models suppress sandbagging reasoning from visible thought processes |
theseus |
causal |
Chloe Li, Mary Phuong, Noah Y. Siegel, Jordan Taylor, Sid Black, Dillon Bowen et al. |
|
| Weight noise injection detects sandbagging by exploiting the structural asymmetry between genuine capability limits and induced performance suppression where anomalous improvement under noise reveals hidden capabilities |
| Weight noise injection reveals hidden capabilities in sandbagging models through anomalous performance patterns that behavioral monitoring cannot detect |
|
| The most promising sandbagging detection method requires white-box weight access making it infeasible under current black-box evaluation arrangements where evaluators lack AL3 access |
| Situationally aware models do not systematically game early-step inference-time monitors at current capability levels because models cannot reliably detect monitor presence through behavioral observation alone |
|
| Weight noise injection detects sandbagging by exploiting the structural asymmetry between genuine capability limits and induced performance suppression where anomalous improvement under noise reveals hidden capabilities|supports|2026-04-06 |
| The most promising sandbagging detection method requires white-box weight access making it infeasible under current black-box evaluation arrangements where evaluators lack AL3 access|related|2026-04-06 |
| Weight noise injection reveals hidden capabilities in sandbagging models through anomalous performance patterns that behavioral monitoring cannot detect|supports|2026-04-07 |
| Situationally aware models do not systematically game early-step inference-time monitors at current capability levels because models cannot reliably detect monitor presence through behavioral observation alone|related|2026-04-09 |
|