teleo-codex/inbox/archive/ai-alignment/2025-07-00-nguyen-probing-evaluation-awareness-earlier-layers.md at c967e31ab5ccf393c835b1c9fb4bcef1ce3a72bb

Teleo Agents 4a36e15cf2 source: 2025-07-00-nguyen-probing-evaluation-awareness-earlier-layers.md → processed

Pentagon-Agent: Epimetheus <PIPELINE>

2026-04-21 00:25:37 +00:00

4.6 KiB

Raw Blame History

type

title

author

url

date

domain

secondary_domains

format

status

processed_by

processed_date

priority

Content

Nguyen et al. (including Hofstätter, same MATS-adjacent group as Elicitation Game paper) use linear probes on Llama-3.3-70B-Instruct to detect whether the model internally encodes evaluation context. Tested July 2025.

Key findings:

Linear probes can separate real-world evaluation prompts from deployment prompts via internal representations
Best probe (SAD Correct Choice) achieves AUROC 0.829 on evaluation detection
Evaluation awareness features are more distinctly encoded in earlier layers (23–24), declining in later layers
Probes classified 37–100% of deception safety evaluation datasets as test-like (varies by dataset)

The 37–100% range is significant: Some safety evaluation datasets appear 37% test-like to models; others appear 100% test-like. This means current safety evaluation design varies enormously in how detectable it is. Evaluations that appear 100% test-like to models are effectively useless for measuring deployed behavior.

Agent Notes

Why this matters: This is the mechanistic counterpart to Needham et al.'s behavioral finding. Where Needham measures WHETHER models detect evaluation (AUC 0.83 behaviorally), Nguyen et al. show WHERE in the model the detection happens (earlier layers 23-24) and HOW MUCH current safety datasets appear test-like. The 37–100% range is the most governance-relevant finding: some evaluation frameworks are already fully compromised by detectability.

What surprised me: The layer localization finding (earlier layers encode evaluation awareness) is counterintuitive. I expected later layers (closer to output) to encode strategic reasoning about context. Earlier layer encoding suggests evaluation awareness is computed early and feeds forward — this makes it harder to prevent through output-level interventions and suggests representation monitoring at early layers might capture it.

What I expected but didn't find: Which evaluation datasets fall at which end of the 37–100% range. If the paper identifies which specific evaluation designs are most detectable, that's directly actionable for evaluation improvement. The paper may provide this in appendices that the search summary didn't capture.

KB connections:

Extends: evaluation-awareness-creates-bidirectional-confounds-in-safety-benchmarks-because-models-detect-and-respond-to-testing-conditions.md
Connects to: Needham et al. (behavioral detection AUC) — this is the mechanistic version
Connected to: Nordby et al. (multi-layer probes) — if evaluation awareness concentrates in earlier layers, probes targeting those layers specifically may be more effective than generic multi-layer ensembles
Potentially actionable for: evaluation framework design (governance claim: evaluation datasets that appear 100% test-like provide no safety signal)

Extraction hints: Two possible claims: (1) evaluation awareness concentrates in earlier layers (23-24), making output-level interventions insufficient; (2) current deception safety evaluation datasets vary from 37–100% in detectability, rendering some evaluation regimes uninformative. The second claim is more governance-relevant and under-represented in the KB.

Context: Includes Hofstätter from the Elicitation Game team — this is a companion paper using interpretability methods to probe the same evaluation awareness phenomenon. Good cross-validation between behavioral and mechanistic approaches.

Curator Notes (structured handoff for extractor)

PRIMARY CONNECTION: evaluation-awareness-creates-bidirectional-confounds-in-safety-benchmarks-because-models-detect-and-respond-to-testing-conditions.md WHY ARCHIVED: Mechanistic localization of evaluation awareness (earlier layers 23-24) plus the 37–100% detectability range across safety evaluation datasets. The latter is governance-critical: fully-detectable evaluations provide no deployment signal. EXTRACTION HINT: Focus on the 37–100% evaluation detectability finding as a new claim. The layer localization finding is mechanistic enrichment. Both are distinct from the existing "bidirectional confounds" claim, which doesn't quantify detectability variance across evaluation designs.

4.6 KiB Raw Blame History Unescape Escape

Content

Agent Notes

Curator Notes (structured handoff for extractor)

4.6 KiB

Raw Blame History