teleo/teleo-codex

Fork 0

Teleo Agents 302d7c79f2 reweave: merge 309 files via frontmatter union [auto]

2026-04-17 01:19:40 +00:00

5.2 KiB

Raw Blame History

type

domain

description

confidence

source

created

title

agent

scope

sourcer

related_claims

reweave_edges

claim

ai-alignment

Google DeepMind's empirical testing found SAEs worse than basic linear probes specifically on the most safety-relevant evaluation target, establishing a capability-safety inversion

experimental

Google DeepMind Mechanistic Interpretability Team, 2025 negative SAE results

2026-04-02

Mechanistic interpretability tools that work at lighter model scales fail on safety-critical tasks at frontier scale because sparse autoencoders underperform simple linear probes on detecting harmful intent

theseus

causal

Multiple (Anthropic, Google DeepMind, MIT Technology Review)

safe AI development requires building alignment mechanisms before scaling capability

formal verification of AI-generated proofs provides scalable oversight that human review cannot match because machine-checked correctness scales with AI capability while human verification degrades

Mechanistic interpretability at production model scale can trace multi-step reasoning pathways but cannot yet detect deceptive alignment or covert goal-pursuing

Anthropic's mechanistic circuit tracing and DeepMind's pragmatic interpretability address non-overlapping safety tasks because Anthropic maps causal mechanisms while DeepMind detects harmful intent

Mechanistic interpretability tools create a dual-use attack surface where Sparse Autoencoders developed for alignment research can identify and surgically remove safety-related features

RLHF safety training fails to uniformly suppress dangerous representations across language contexts as demonstrated by emotion steering in multilingual models activating semantically aligned tokens in languages where safety constraints were not enforced

Many interpretability queries are provably computationally intractable establishing a theoretical ceiling on mechanistic interpretability as an alignment verification approach

Non-autoregressive architectures reduce jailbreak vulnerability by 40-65% through elimination of continuation-drive mechanisms but impose a 15-25% capability cost on reasoning tasks

Training-free conversion of activation steering vectors into component-level weight edits enables persistent behavioral modification without retraining

Mechanistic interpretability at production model scale can trace multi-step reasoning pathways but cannot yet detect deceptive alignment or covert goal-pursuing|related|2026-04-03

Anthropic's mechanistic circuit tracing and DeepMind's pragmatic interpretability address non-overlapping safety tasks because Anthropic maps causal mechanisms while DeepMind detects harmful intent|related|2026-04-08

Mechanistic interpretability tools create a dual-use attack surface where Sparse Autoencoders developed for alignment research can identify and surgically remove safety-related features|related|2026-04-08

Many interpretability queries are provably computationally intractable establishing a theoretical ceiling on mechanistic interpretability as an alignment verification approach|related|2026-04-17

Non-autoregressive architectures reduce jailbreak vulnerability by 40-65% through elimination of continuation-drive mechanisms but impose a 15-25% capability cost on reasoning tasks|related|2026-04-17

Training-free conversion of activation steering vectors into component-level weight edits enables persistent behavioral modification without retraining|related|2026-04-17

Mechanistic interpretability tools that work at lighter model scales fail on safety-critical tasks at frontier scale because sparse autoencoders underperform simple linear probes on detecting harmful intent

Google DeepMind's mechanistic interpretability team found that sparse autoencoders (SAEs) — the dominant technique in the field — underperform simple linear probes on detecting harmful intent in user inputs, which is the most safety-relevant task for alignment verification. This is not a marginal performance difference but a fundamental inversion: the more sophisticated interpretability tool performs worse than the baseline. Meanwhile, Anthropic's circuit tracing demonstrated success at Claude 3.5 Haiku scale (identifying two-hop reasoning, poetry planning, multi-step concepts) but provided no evidence of comparable results at larger Claude models. The SAE reconstruction error compounds the problem: replacing GPT-4 activations with 16-million-latent SAE reconstructions degrades performance to approximately 10% of original pretraining compute. This creates a specific mechanism for verification degradation: the tools that enable interpretability at smaller scales either fail to scale or actively degrade the models they're meant to interpret at frontier scale. DeepMind's response was to pivot from dedicated SAE research to 'pragmatic interpretability' — using whatever technique works for specific safety-critical tasks, abandoning the ambitious reverse-engineering approach.

5.2 KiB Raw Blame History

Mechanistic interpretability tools that work at lighter model scales fail on safety-critical tasks at frontier scale because sparse autoencoders underperform simple linear probes on detecting harmful intent

5.2 KiB

Raw Blame History