teleo-codex/domains/ai-alignment/multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent.md at ce04932b6ee9717baab9ee75248f7308a5e5ab49

teleo/teleo-codex

Fork 0

Teleo Agents ce04932b6e

Mirror PR to Forgejo / mirror (pull_request) Has been cancelled

Details

substantive-fix: address reviewer feedback (confidence_miscalibration)

2026-04-22 02:48:36 +00:00

434 B

Raw Blame History

## Extending Evidence

**Source:** Theseus synthetic analysis

The 29-78% AUROC improvement is a a clean-data accuracy result. Nordby et al. did not test adversarial robustness. Under white-box adversarial conditions (open-weights models), multi-layer ensembles are predicted to provide no structural protection against SCAV-generalization attacks. This improvement applies only to non-adversarial monitoring contexts.

434 B Raw Blame History

434 B

Raw Blame History