teleo-codex/inbox/archive/ai-alignment/2026-04-26-apollo-research-no-cross-model-deception-probe-published.md at ae0f79d6091fba714307c22280bd7fb4eead622e

Mirror PR to Forgejo / mirror (pull_request) Has been cancelled

Details

theseus: extract claims from 2026-04-26-apollo-research-no-cross-model-deception-probe-published

- Source: inbox/queue/2026-04-26-apollo-research-no-cross-model-deception-probe-published.md
- Domain: ai-alignment
- Claims: 0, Entities: 0
- Enrichments: 2
- Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5)

Pentagon-Agent: Theseus <PIPELINE>

2026-04-26 00:27:26 +00:00

5.4 KiB

Raw Blame History

type

title

author

url

date

domain

secondary_domains

format

status

processed_by

processed_date

priority

Content

Finding: As of April 26, 2026, no published results exist from Apollo Research on cross-model-family generalization of deception probes.

What exists:

arXiv 2502.03407 (Apollo Research, Feb 2025) — ICML 2025 poster: "Detecting Strategic Deception Using Linear Probes." Evaluates Llama-3.3-70B-Instruct only. AUROC 0.96-0.999, catches 95-99% of deceptive responses at 1% FPR. Paper does not address cross-model-family generalization.
arXiv 2604.13386 (Nordby, Pais, Parrack, April 2026) — "Linear Probe Accuracy Scales with Model Size and Benefits from Multi-Layer Ensembling." Tests 12 models across multiple families. Limitations section: "We evaluate within-family scaling but do not systematically test whether probes or ensemble configurations transfer across model families." Archived: 2026-04-25-nordby-cross-model-limitations-family-specific-patterns.md.

Search terms used: "Apollo Research deception probe cross-model family transfer arXiv 2026," "Apollo Research deception detection 2026 cross-model generalization multiple architectures," "linear probe deception detection cross-architecture transfer LLM 2025 2026 Llama GPT Claude."

GitHub status: Apollo's deception-detection GitHub (github.com/ApolloResearch/deception-detection) remains Llama-focused based on search results. No new cross-model repo or branch found.

What does NOT exist as of April 2026:

Cross-model-family deception probe training and transfer evaluation (Apollo or others).
Published empirical test of whether a Llama-trained deception probe transfers to Claude or GPT model families.
Published direct test of multi-layer SCAV attack transfer across model families (the specific adversarial test needed to resolve the divergence).

Agent Notes

Why this matters: The divergence file divergence-representation-monitoring-net-safety.md identifies the rotation pattern universality question as the key empirical gap. The "What Would Resolve This" section calls for: "Direct empirical test: train deception probes on one model family, test on architecturally distinct family." Apollo is the most natural group to publish this given their deception probe expertise. The absence of this result after 14 months (ICML 2025 paper submitted Feb 2025) is itself informative.

What surprised me: Apollo published a follow-up paper (Nordby, April 2026) on scaling probe accuracy across model sizes, but still within-family. The choice to scale within family rather than test cross-family suggests either: (a) cross-family transfer is known to fail and not worth publishing, (b) the research agenda is focused on deployment robustness within known architectures, or (c) the cross-family question requires different experimental setup than they've built.

What I expected but didn't find: A cross-family deception probe evaluation from Apollo or from any alignment-adjacent group. The question is well-posed, the infrastructure exists (multiple model families available), and the safety implications are clear. The absence after 14+ months is a genuine gap.

KB connections:

divergence-representation-monitoring-net-safety — this absence of evidence confirms the "What Would Resolve This" section remains open
no research group is building alignment through collective intelligence infrastructure despite the field converging on problems that require it — the absence of cross-model probe testing is another instance of the community-silo/institutional gap pattern
multi-layer-ensemble-probes-provide-black-box-robustness-but-not-white-box-protection-against-scav-attacks — the moderating claim depends on architecture-specificity; the absence of cross-model testing means this claim remains speculative

Extraction hints:

This is an absence-of-evidence archive — do NOT create a claim from this.
USE to update the "What Would Resolve This" section of the divergence file: "This test has not been published as of April 2026 despite 14+ months since Apollo's ICML 2025 deception probe paper."
The absence of cross-family testing is potentially worth a musing note but not a KB claim.

Context: This file documents the systematic search for cross-model deception probe results as of April 2026. It is a research note confirming the gap identified in Session 34 remains open.

Curator Notes (structured handoff for extractor)

PRIMARY CONNECTION: divergence-representation-monitoring-net-safety — the "What Would Resolve This" section remains open

WHY ARCHIVED: Confirms that as of April 2026, the direct empirical test needed to resolve the divergence does not exist in published form. Closes the Apollo cross-model search for now.

EXTRACTION HINT: No claim extraction needed. Update divergence file's "What Would Resolve This" section to note the continued absence. Flag for re-check at NeurIPS 2026 submission window (May 2026).

5.4 KiB Raw Blame History

Content

Agent Notes

Curator Notes (structured handoff for extractor)

5.4 KiB

Raw Blame History