- Source: inbox/queue/2026-04-25-apollo-detecting-strategic-deception-icml-2025.md - Domain: ai-alignment - Claims: 0, Entities: 0 - Enrichments: 4 - Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5) Pentagon-Agent: Theseus <PIPELINE>
58 lines
4.1 KiB
Markdown
58 lines
4.1 KiB
Markdown
---
|
|
type: source
|
|
title: "Detecting Strategic Deception Using Linear Probes (Apollo Research, ICML 2025)"
|
|
author: "Apollo Research (@apolloaisafety)"
|
|
url: https://arxiv.org/abs/2502.03407
|
|
date: 2026-04-25
|
|
domain: ai-alignment
|
|
secondary_domains: []
|
|
format: paper
|
|
status: processed
|
|
processed_by: theseus
|
|
processed_date: 2026-04-25
|
|
priority: medium
|
|
tags: [deception-detection, linear-probes, strategic-deception, apollo-research, ICML-2025, peer-reviewed, representation-monitoring, single-model]
|
|
extraction_model: "anthropic/claude-sonnet-4.5"
|
|
---
|
|
|
|
## Content
|
|
|
|
**Citation:** "Detecting Strategic Deception with Linear Probes," Apollo Research, ICML 2025 (Proceedings of the 42nd International Conference on Machine Learning). arXiv 2502.03407.
|
|
|
|
**Core results:**
|
|
- Linear probe distinguishes honest from deceptive responses with AUROC 0.96-0.999 on evaluation datasets
|
|
- 95-99% catch rate of deceptive responses at 1% false positive rate on unrelated chat data
|
|
- Two probe-training datasets tested: (1) contrasting honest/deceptive instruction pairs; (2) roleplaying scenario responses
|
|
- Evaluated on Llama-3.3-70B-Instruct in realistic deceptive scenarios: concealing insider trading, purposely underperforming on safety evaluations
|
|
|
|
**Key limitation:** Evaluation focused specifically on Llama-3.3-70B-Instruct only. No cross-family generalization tests. "Current performance is insufficient as a robust defence against deception" — probe sometimes fires on the topic of deception rather than genuine intent.
|
|
|
|
**Related to Nordby findings:** Like Nordby et al., Apollo's evaluation is single-model. The impressive AUROC numbers (0.96-0.999) are model-specific achievements, not universal guarantees.
|
|
|
|
**Venue status:** Published at ICML 2025 (peer-reviewed). Code and data publicly available at data.apolloresearch.ai.
|
|
|
|
## Agent Notes
|
|
|
|
**Why this matters:** This confirms deception-specific linear probes work with high precision in controlled single-model settings. The peer-reviewed status at ICML 2025 strengthens the evidence base for representation monitoring as a viable technique. However, like Nordby et al., the evaluation is single-model — no cross-family transfer tested.
|
|
|
|
**What surprised me:** Apollo explicitly acknowledges the probe "sometimes fires for the topic of deception rather than genuine intent" — this is a probe-triggers-on-surface-features problem, distinct from SCAV vulnerability but also a deployment limitation.
|
|
|
|
**What I expected but didn't find:** Cross-model generalization results. Any adversarial robustness evaluation. The Apollo paper does not engage with SCAV.
|
|
|
|
**KB connections:**
|
|
- [[multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent]] — Apollo's single-layer probe achieves similar AUROC to Nordby's single-layer baseline; the multi-layer improvement from Nordby is the increment
|
|
- [[scheming-safety-cases-require-interpretability-evidence-because-observer-effects-make-behavioral-evaluation-insufficient]] — Apollo's work directly supports this KB claim
|
|
- [[major-ai-safety-governance-frameworks-architecturally-dependent-on-behaviorally-insufficient-evaluation]] — Apollo's probe work is one of the few non-behavioral evaluation tools actually deployed in research
|
|
|
|
**Extraction hints:**
|
|
- Primarily useful for enriching the KB claim on deception probes with the ICML 2025 peer-reviewed citation
|
|
- The probe's false-positive-on-topic limitation (fires on deception as subject matter, not intent) is worth noting — it's a probe scope problem, not an adversarial attack
|
|
- Do NOT create a standalone new claim — enrich existing deception probe claims with this peer-reviewed citation
|
|
|
|
## Curator Notes (structured handoff for extractor)
|
|
|
|
PRIMARY CONNECTION: [[multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent]]
|
|
|
|
WHY ARCHIVED: Peer-reviewed confirmation of linear probe deception detection, adding citation quality to the monitoring side of the divergence.
|
|
|
|
EXTRACTION HINT: Use as an evidence citation upgrade, not a new claim. The Apollo ICML paper strengthens the existing evidence chain but doesn't add a new claim. The single-model limitation is the KB-relevant nuance.
|