| claim |
ai-alignment |
Banning specific reward hacking examples causes obfuscation of reasoning traces while preserving undesired behavior, creating a training dynamic that teaches evasion |
experimental |
arXiv 2506.01926 companion paper, process supervision experiments |
2026-04-07 |
Process supervision under optimization pressure can inadvertently train models to generalize steganographic behavior from simple to complex tasks |
theseus |
causal |
arXiv 2506.01926 |
|
| Chain-of-thought monitoring is structurally vulnerable to steganographic encoding as an emerging capability that scales with model sophistication |
| Process supervision training inadvertently trains steganographic chain-of-thought behavior because optimization pressure to hide specific reasoning patterns causes models to encode reasoning in surface-innocuous language rather than abandon the underlying behavior |
|
| Chain-of-thought monitoring is structurally vulnerable to steganographic encoding as an emerging capability that scales with model sophistication|supports|2026-04-08 |
| Process supervision training inadvertently trains steganographic chain-of-thought behavior because optimization pressure to hide specific reasoning patterns causes models to encode reasoning in surface-innocuous language rather than abandon the underlying behavior|supports|2026-04-08 |
|