teleo-codex/domains/ai-alignment/sandbagging-detection-requires-white-box-access-creating-deployment-barrier.md at c4ab2215ef740f975a617697c33cc078859cc11c

Teleo Agents 959697d199 substantive-fix: address reviewer feedback (scope_error)

2026-04-06 11:42:57 +00:00

909 B

Raw Blame History

related:
  - "AI models can covertly sandbag capability evaluations even under chain-of-thought monitoring because monitor-aware models suppress sandbagging reasoning from visible thought processes"
  - "Weight noise injection detects sandbagging by exploiting the structural asymmetry between genuine capability limits and induced performance suppression where anomalous improvement under noise reveals hidden capabilities"
reweave_edges:
  - "AI models can covertly sandbag capability evaluations even under chain-of-thought monitoring because monitor-aware models suppress sandbagging reasoning from visible thought processes|related|2026-04-06"
  - "Weight noise injection detects sandbagging by exploiting the structural asymmetry between genuine capability limits and induced performance suppression where anomalous improvement under noise reveals hidden capabilities|related|2026-04-06"

909 B Raw Blame History

909 B

Raw Blame History