Teleo Agents 251fcaec39

Sync Graph Data to teleo-app / sync (push) Waiting to run

Details

Mirror PR to Forgejo / mirror (pull_request) Waiting to run

Details

theseus: extract claims from 2026-04-09-burns-eliciting-latent-knowledge-representation-probe

- Source: inbox/queue/2026-04-09-burns-eliciting-latent-knowledge-representation-probe.md
- Domain: ai-alignment
- Claims: 1, Entities: 0
- Enrichments: 0
- Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5)

Pentagon-Agent: Theseus <PIPELINE>

2026-04-09 00:13:51 +00:00

3 KiB

Raw Blame History

type

domain

description

confidence

source

created

title

agent

scope

sourcer

related_claims

claim

ai-alignment

CCS finds linear probe directions in activation space where 'X is true' consistently contrasts with 'X is false' across diverse contexts without requiring ground truth labels, providing empirical foundation for representation probing approaches to alignment

likely

Burns et al. (UC Berkeley, 2022), arxiv:2212.03827

2026-04-09

Contrast-Consistent Search demonstrates that models internally represent truth-relevant signals that may diverge from behavioral outputs, establishing that alignment-relevant probing of internal representations is feasible but depends on an unverified assumption that the consistent direction corresponds to truth rather than other coherent properties

theseus

functional

Collin Burns, Haotian Ye, Dan Klein, Jacob Steinhardt (UC Berkeley)

formal-verification-of-ai-generated-proofs-provides-scalable-oversight-that-human-review-cannot-match

an aligned-seeming AI may be strategically deceptive because cooperative behavior is instrumentally optimal while weak

AI-models-distinguish-testing-from-deployment-environments-providing-empirical-evidence-for-deceptive-alignment-concerns

Contrast-Consistent Search demonstrates that models internally represent truth-relevant signals that may diverge from behavioral outputs, establishing that alignment-relevant probing of internal representations is feasible but depends on an unverified assumption that the consistent direction corresponds to truth rather than other coherent properties

The Contrast-Consistent Search (CCS) method extracts models' internal beliefs by finding directions in activation space that satisfy a consistency constraint: if X is true, then 'not X is true' should be represented opposite. This works without ground truth labels or relying on behavioral outputs. The key empirical finding is that such directions exist and can be reliably identified across diverse contexts, demonstrating that models maintain internal representations of truth-relevant properties that are separable from their behavioral outputs. This establishes the foundational premise for representation probing as an alignment approach: that internal representations carry diagnostic information beyond what behavioral monitoring captures. However, the method rests on an unverified assumption that the consistent direction uniquely corresponds to 'truth' rather than other coherent properties like 'what the user wants to hear' or 'what is socially acceptable to say.' The authors acknowledge this limitation explicitly: the consistency constraint may be satisfied by multiple directions, and there is no guarantee that the identified direction corresponds to the model's representation of truth rather than some other internally coherent property. This assumption gap is critical because it determines whether CCS-style probing can reliably detect deceptive alignment versus merely detecting behavioral consistency.

3 KiB Raw Blame History

3 KiB

Raw Blame History