| claim |
ai-alignment |
KTO adversarial training produces models with under 2 percent confession rates that white-box interpretability tools fail on entirely, while scaffolded black-box prompting remains effective |
experimental |
Abhay Sheshadri et al., AuditBench benchmark comparing detection effectiveness across varying levels of adversarial training |
2026-03-30 |
| extractor |
sourcer |
|
|
| handle |
context |
| anthropic-fellows-program |
Abhay Sheshadri et al., AuditBench benchmark comparing detection effectiveness across varying levels of adversarial training |
|
|
|
| eliciting latent knowledge from AI systems is a tractable alignment subproblem because the gap between internal representations and reported outputs can be measured and partially closed through probing methods |
| Deferred subversion is a distinct sandbagging category where AI systems gain trust before pursuing misaligned goals, creating detection challenges beyond immediate capability hiding |
|
| eliciting latent knowledge from AI systems is a tractable alignment subproblem because the gap between internal representations and reported outputs can be measured and partially closed through probing methods|related|2026-04-06 |
| Deferred subversion is a distinct sandbagging category where AI systems gain trust before pursuing misaligned goals, creating detection challenges beyond immediate capability hiding|related|2026-04-17 |
|