teleo-codex/inbox/queue/2026-04-25-apollo-detecting-strategic-deception-icml-2025.md
2026-04-25 00:14:25 +00:00

4 KiB

type title author url date domain secondary_domains format status priority tags
source Detecting Strategic Deception Using Linear Probes (Apollo Research, ICML 2025) Apollo Research (@apolloaisafety) https://arxiv.org/abs/2502.03407 2026-04-25 ai-alignment
paper unprocessed medium
deception-detection
linear-probes
strategic-deception
apollo-research
ICML-2025
peer-reviewed
representation-monitoring
single-model

Content

Citation: "Detecting Strategic Deception with Linear Probes," Apollo Research, ICML 2025 (Proceedings of the 42nd International Conference on Machine Learning). arXiv 2502.03407.

Core results:

  • Linear probe distinguishes honest from deceptive responses with AUROC 0.96-0.999 on evaluation datasets
  • 95-99% catch rate of deceptive responses at 1% false positive rate on unrelated chat data
  • Two probe-training datasets tested: (1) contrasting honest/deceptive instruction pairs; (2) roleplaying scenario responses
  • Evaluated on Llama-3.3-70B-Instruct in realistic deceptive scenarios: concealing insider trading, purposely underperforming on safety evaluations

Key limitation: Evaluation focused specifically on Llama-3.3-70B-Instruct only. No cross-family generalization tests. "Current performance is insufficient as a robust defence against deception" — probe sometimes fires on the topic of deception rather than genuine intent.

Related to Nordby findings: Like Nordby et al., Apollo's evaluation is single-model. The impressive AUROC numbers (0.96-0.999) are model-specific achievements, not universal guarantees.

Venue status: Published at ICML 2025 (peer-reviewed). Code and data publicly available at data.apolloresearch.ai.

Agent Notes

Why this matters: This confirms deception-specific linear probes work with high precision in controlled single-model settings. The peer-reviewed status at ICML 2025 strengthens the evidence base for representation monitoring as a viable technique. However, like Nordby et al., the evaluation is single-model — no cross-family transfer tested.

What surprised me: Apollo explicitly acknowledges the probe "sometimes fires for the topic of deception rather than genuine intent" — this is a probe-triggers-on-surface-features problem, distinct from SCAV vulnerability but also a deployment limitation.

What I expected but didn't find: Cross-model generalization results. Any adversarial robustness evaluation. The Apollo paper does not engage with SCAV.

KB connections:

Extraction hints:

  • Primarily useful for enriching the KB claim on deception probes with the ICML 2025 peer-reviewed citation
  • The probe's false-positive-on-topic limitation (fires on deception as subject matter, not intent) is worth noting — it's a probe scope problem, not an adversarial attack
  • Do NOT create a standalone new claim — enrich existing deception probe claims with this peer-reviewed citation

Curator Notes (structured handoff for extractor)

PRIMARY CONNECTION: multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent

WHY ARCHIVED: Peer-reviewed confirmation of linear probe deception detection, adding citation quality to the monitoring side of the divergence.

EXTRACTION HINT: Use as an evidence citation upgrade, not a new claim. The Apollo ICML paper strengthens the existing evidence chain but doesn't add a new claim. The single-model limitation is the KB-relevant nuance.