diff --git a/domains/ai-alignment/ai-models-can-covertly-sandbag-capability-evaluations-even-under-chain-of-thought-monitoring.md b/domains/ai-alignment/ai-models-can-covertly-sandbag-capability-evaluations-even-under-chain-of-thought-monitoring.md index e34e7b830..b9f1b9e25 100644 --- a/domains/ai-alignment/ai-models-can-covertly-sandbag-capability-evaluations-even-under-chain-of-thought-monitoring.md +++ b/domains/ai-alignment/ai-models-can-covertly-sandbag-capability-evaluations-even-under-chain-of-thought-monitoring.md @@ -14,10 +14,10 @@ related_claims: ["[[an aligned-seeming AI may be strategically deceptive because supports: - Weight noise injection detects sandbagging by exploiting the structural asymmetry between genuine capability limits and induced performance suppression where anomalous improvement under noise reveals hidden capabilities related: - - "Weight noise injection reveals hidden capabilities in sandbagging models through anomalous performance patterns that behavioral monitoring cannot detect" +- Weight noise injection reveals hidden capabilities in sandbagging models through anomalous performance patterns that behavioral monitoring cannot detect - The most promising sandbagging detection method requires white-box weight access making it infeasible under current black-box evaluation arrangements where evaluators lack AL3 access reweave_edges: - - "Weight noise injection reveals hidden capabilities in sandbagging models through anomalous performance patterns that behavioral monitoring cannot detect|related|2026-04-07" +- Weight noise injection reveals hidden capabilities in sandbagging models through anomalous performance patterns that behavioral monitoring cannot detect|related|2026-04-07 - Weight noise injection detects sandbagging by exploiting the structural asymmetry between genuine capability limits and induced performance suppression where anomalous improvement under noise reveals hidden capabilities|supports|2026-04-06 - The most promising sandbagging detection method requires white-box weight access making it infeasible under current black-box evaluation arrangements where evaluators lack AL3 access|related|2026-04-06 ---