| claim |
ai-alignment |
KTO adversarial training produces models with under 2 percent confession rates that white-box interpretability tools fail on entirely, while scaffolded black-box prompting remains effective |
experimental |
Abhay Sheshadri et al., AuditBench benchmark comparing detection effectiveness across varying levels of adversarial training |
2026-03-30 |
| extractor |
sourcer |
|
|
| handle |
context |
| anthropic-fellows-program |
Abhay Sheshadri et al., AuditBench benchmark comparing detection effectiveness across varying levels of adversarial training |
|
|
|
| eliciting latent knowledge from AI systems is a tractable alignment subproblem because the gap between internal representations and reported outputs can be measured and partially closed through probing methods |
| Deferred subversion is a distinct sandbagging category where AI systems gain trust before pursuing misaligned goals, creating detection challenges beyond immediate capability hiding |
| Multi-layer ensemble probes improve deception detection AUROC by 29-78 percent over single-layer probes because deception directions rotate gradually across layers |
|
| eliciting latent knowledge from AI systems is a tractable alignment subproblem because the gap between internal representations and reported outputs can be measured and partially closed through probing methods|related|2026-04-06 |
| Deferred subversion is a distinct sandbagging category where AI systems gain trust before pursuing misaligned goals, creating detection challenges beyond immediate capability hiding|related|2026-04-17 |
| Multi-layer ensemble probes improve deception detection AUROC by 29-78 percent over single-layer probes because deception directions rotate gradually across layers|related|2026-04-21 |
|