teleo/teleo-codex

Fork 0

Teleo Agents d8dfbeb5d4

Sync Graph Data to teleo-app / sync (push) Waiting to run

Details

reweave: merge 20 files via frontmatter union [auto]

2026-04-08 01:10:40 +00:00

3.1 KiB

Raw Blame History

type

domain

description

confidence

source

created

title

agent

scope

sourcer

related_claims

reweave_edges

claim

ai-alignment

There is a gap between demonstrated interpretability capability (how it reasons) and alignment-relevant verification capability (whether it has deceptive goals)

experimental

Anthropic Interpretability Team, Circuit Tracing release March 2025

2026-04-02

Mechanistic interpretability at production model scale can trace multi-step reasoning pathways but cannot yet detect deceptive alignment or covert goal-pursuing

theseus

functional

Anthropic Interpretability Team

verification degrades faster than capability grows

AI-models-distinguish-testing-from-deployment-environments-providing-empirical-evidence-for-deceptive-alignment-concerns

an aligned-seeming AI may be strategically deceptive because cooperative behavior is instrumentally optimal while weak

Mechanistic interpretability tools that work at lighter model scales fail on safety-critical tasks at frontier scale because sparse autoencoders underperform simple linear probes on detecting harmful intent

Anthropic's mechanistic circuit tracing and DeepMind's pragmatic interpretability address non-overlapping safety tasks because Anthropic maps causal mechanisms while DeepMind detects harmful intent

Anthropic's mechanistic circuit tracing and DeepMind's pragmatic interpretability address non-overlapping safety tasks because Anthropic maps causal mechanisms while DeepMind detects harmful intent|related|2026-04-08

Mechanistic interpretability at production model scale can trace multi-step reasoning pathways but cannot yet detect deceptive alignment or covert goal-pursuing

Anthropic's circuit tracing work on Claude 3.5 Haiku demonstrates genuine technical progress in mechanistic interpretability at production scale. The team successfully traced two-hop reasoning ('the capital of the state containing Dallas' → 'Texas' → 'Austin'), showing they could see and manipulate intermediate representations. They also traced poetry planning where the model identifies potential rhyming words before writing each line. However, the demonstrated capabilities are limited to observing HOW the model reasons, not WHETHER it has hidden goals or deceptive tendencies. Dario Amodei's stated goal is to 'reliably detect most AI model problems by 2027' — framing this as future aspiration rather than current capability. The work does not demonstrate detection of scheming, deceptive alignment, or power-seeking behaviors. This creates a critical gap: the tools can reveal computational pathways but cannot yet answer the alignment-relevant question of whether a model is strategically deceptive or pursuing covert goals. The scale achievement (production model, not toy) is meaningful, but the capability demonstrated addresses transparency of reasoning processes rather than verification of alignment.

3.1 KiB Raw Blame History

Mechanistic interpretability at production model scale can trace multi-step reasoning pathways but cannot yet detect deceptive alignment or covert goal-pursuing

3.1 KiB

Raw Blame History