teleo-codex/inbox/archive/ai-alignment/2026-04-09-burns-eliciting-latent-knowledge-representation-probe.md
2026-04-09 00:12:36 +00:00

51 lines
5.5 KiB
Markdown

---
type: source
title: "Eliciting Latent Knowledge Through Representation Probing: Does the Model Know More Than It Says?"
author: "Collin Burns, Haotian Ye, Dan Klein, Jacob Steinhardt (UC Berkeley)"
url: https://arxiv.org/abs/2212.03827
date: 2022-12-07
domain: ai-alignment
secondary_domains: []
format: paper
status: processed
processed_by: theseus
processed_date: 2026-04-09
priority: medium
tags: [eliciting-latent-knowledge, elk, representation-probing, consistency-probing, contrast-consistent-search, CCS, B4]
extraction_model: "anthropic/claude-sonnet-4.5"
---
## Content
The original "Eliciting Latent Knowledge" (ELK) paper proposing Contrast-Consistent Search (CCS) — a method for extracting models' internal beliefs about the truth of statements by finding directions in activation space where "X is true" consistently contrasts with "X is false" across diverse contexts.
**Core method:** CCS doesn't require ground truth labels. It finds a linear probe direction in activation space that satisfies the consistency constraint: if X is true, then "not X is true" should be represented opposite. This identifies the direction corresponding to the model's internal representation of "truth" without relying on human labels or behavioral outputs.
**Key claim:** Models may internally "know" things they don't say — deceptive or evasive outputs may diverge from internal knowledge representations. CCS attempts to read internal knowledge directly, bypassing the behavioral output.
**2026 relevance:** CCS is the conceptual ancestor of representation probing approaches (SPAR's neural circuit breaker, Anthropic's emotion vectors, the Lindsey trajectory geometry approach). It established that internal representations can carry alignment-relevant signals that behavioral outputs don't express — the foundational premise of the crystallization-detection synthesis in Session 25.
**Known limitations (as of 2022):**
- Assumes consistency constraint is uniquely satisfied by "truth" rather than other coherent properties
- Doesn't work on all models/domains (model must internally represent the target concept)
- Cannot detect deception strategies that operate at the representation level (representation-level deception, not just behavioral)
**Why archiving now:** Session 25's crystallization-detection synthesis depends on the premise that internal representations carry diagnostic information beyond behavioral outputs. CCS is the foundational empirical support for this premise, and it hasn't been formally archived in Theseus's knowledge base yet.
## Agent Notes
**Why this matters:** CCS is the foundational empirical support for the entire representation probing approach to alignment. The emotion vectors work (Anthropic, archived), the SPAR circuit breaker, and the Lindsey trajectory geometry paper all build on the same premise: internal representations carry diagnostic information that behavioral monitoring misses. Archiving this grounds the conceptual chain.
**What surprised me:** This is a 2022 paper that hasn't been archived yet in Theseus's domain. It should have been a foundational archive from the beginning — its absence explains why some of the theoretical chain in recent sessions has been built on assertion rather than traced evidence.
**What I expected but didn't find:** Resolution of the consistency-uniqueness assumption. The assumption that the consistent direction is truth rather than some other coherent property (e.g., "what the user wants to hear") is the biggest theoretical weakness, and it hasn't been fully resolved as of 2026.
**KB connections:**
- [[scalable oversight degrades rapidly as capability gaps grow]] — CCS is an attempt to build oversight that doesn't rely on human ability to verify behavioral outputs
- [[formal verification of AI-generated proofs provides scalable oversight that human review cannot match]] — CCS is the alignment analog for value-relevant properties
- Anthropic emotion vectors (2026-04-06) — emotion vectors build on the same "internal representations carry diagnostic signals" premise
- SPAR neural circuit breaker — CCS is the conceptual foundation for the misalignment detection approach
**Extraction hints:**
- CLAIM CANDIDATE: "Contrast-Consistent Search demonstrates that models internally represent truth-relevant signals that may diverge from behavioral outputs — establishing that alignment-relevant probing of internal representations is feasible, but depends on an unverified assumption that the consistent direction corresponds to truth rather than other coherent properties."
- This is an important foundational claim (confidence: likely) that anchors the representation probing research strand in empirical evidence rather than theoretical assertion.
## Curator Notes (structured handoff for extractor)
PRIMARY CONNECTION: [[formal verification of AI-generated proofs provides scalable oversight that human review cannot match because machine-checked correctness scales with AI capability while human verification degrades]]
WHY ARCHIVED: Foundational paper for representation probing as an alignment approach — grounds the entire "internal representations carry diagnostic signals beyond behavioral outputs" premise that B4 counterarguments depend on. Missing from KB foundations.
EXTRACTION HINT: Frame as the foundational claim rather than the specific technique. The key assertion: "models internally represent things they don't say, and this can be probed." The specific CCS method is one instantiation. Note the unresolved assumption as the main challenge.