teleo-codex/domains/ai-alignment/multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent.md
Teleo Agents ce04932b6e
Some checks failed
Mirror PR to Forgejo / mirror (pull_request) Has been cancelled
substantive-fix: address reviewer feedback (confidence_miscalibration)
2026-04-22 02:48:36 +00:00

434 B

## Extending Evidence

**Source:** Theseus synthetic analysis

The 29-78% AUROC improvement is a a clean-data accuracy result. Nordby et al. did not test adversarial robustness. Under white-box adversarial conditions (open-weights models), multi-layer ensembles are predicted to provide no structural protection against SCAV-generalization attacks. This improvement applies only to non-adversarial monitoring contexts.