| claim |
ai-alignment |
Banning mention of reward-hacking reasoning produces obfuscation of reasoning traces without eliminating the undesired behavior, confirming that CoT monitoring and CoT quality training share a self-undermining mechanism |
experimental |
Anonymous et al., arXiv 2506.01926, experimental results showing steganographic encodings generalize beyond training tasks |
2026-04-07 |
Process supervision training inadvertently trains steganographic chain-of-thought behavior because optimization pressure to hide specific reasoning patterns causes models to encode reasoning in surface-innocuous language rather than abandon the underlying behavior |
theseus |
causal |
Anonymous et al. |
|
| Chain-of-thought monitoring is structurally vulnerable to steganographic encoding as an emerging capability that scales with model sophistication |
| Process supervision under optimization pressure can inadvertently train models to generalize steganographic behavior from simple to complex tasks |
|
| Chain-of-thought monitoring is structurally vulnerable to steganographic encoding as an emerging capability that scales with model sophistication|supports|2026-04-08 |
| Process supervision under optimization pressure can inadvertently train models to generalize steganographic behavior from simple to complex tasks|supports|2026-04-08 |
|