teleo-codex/inbox/archive/2026-02-23-beaglehole-universal-steering-monitoring-ai-models.md at 5eab862eefaffa8d5a89dddf6eae91439b34a4e2

Theseus d2f8944a19 theseus: commit untracked archive files

Pentagon-Agent: Ship <EF79ADB7-E6D7-48AC-B220-38CA82327C5D>

2026-04-15 17:55:44 +00:00

5.5 KiB

Raw Blame History

type

title

author

url

date

domain

secondary_domains

format

status

priority

Content

Published in Science 391 (6787), 2026. Introduces a scalable approach for extracting linear representations of semantic concepts from large AI models, enabling both steering and monitoring.

Key methodology: Extract linear concept vectors using fewer than 500 training samples in under 1 minute on a single A100 GPU. The concept vectors are "universal" in that they transfer across languages (English concept vectors work for French/German text) and model types (language models, vision-language models, reasoning models).

Key results:

Concept representations are more accurate for monitoring misaligned content (hallucinations, toxic content) than judge model approaches
Larger models are more steerable — the approach scales favorably with capability
Multi-concept steering is feasible; representations transfer across model families
Concept vectors identified in one language work when applied to different languages
Exposed vulnerabilities AND improved model capabilities beyond prompting

Technical note: The approach extracts a single linear direction in activation space corresponding to a semantic concept. This is fundamentally different from SAE decomposition (which identifies many sparse atomic features) but shares the property of identifying alignment-relevant model internals.

Dual-use gap: The paper does not directly address whether the same concept vectors used for monitoring could be used adversarially to suppress safety features. This gap is critical given the SCAV finding (NeurIPS 2024) demonstrating 99.14% attack success using concept activation vectors on LLM safety mechanisms — directly the same technical approach.

Agent Notes

Why this matters: First publication in Science (major venue signal) demonstrating that representation monitoring outperforms behavioral (judge) monitoring for misaligned content. Directly relevant to the B4 active thread: does representation monitoring extend verification runway? Yes, empirically — concept vectors outperform judges. But the dual-use question now has a clear answer from SCAV: linear concept vectors face the same structural attack surface as SAEs, just with lower adversarial precision.

What surprised me: The Science publication venue. This signals mainstream scientific legitimacy for representation engineering as an alignment tool — moving from AI safety community niche to mainstream science. Also: the explicit finding that monitoring outperforms judge models is a strong empirical grounding for representation monitoring over behavioral monitoring.

What I expected but didn't find: Any discussion of the dual-use implications. The paper presents monitoring as purely beneficial without engaging with the adversarial attack surface that SCAV demonstrates. This is a critical omission in an otherwise rigorous paper.

KB connections:

scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps — concept monitoring outperforms judge-based behavioral monitoring, extending verification runway
formal verification of AI-generated proofs provides scalable oversight that human review cannot match — parallel argument: concept representations provide scalable monitoring that human review cannot match in certain domains
B4 active thread: crystallization-detection synthesis — this paper provides empirical grounding that representation monitoring outperforms behavioral monitoring

Extraction hints:

Extract a claim: "Linear concept representation monitoring outperforms judge-based behavioral monitoring for detecting misaligned content in AI systems" — with the Science venue + quantitative monitoring advantage as evidence
Consider pairing with SCAV (NeurIPS 2024) to create a divergence: does monitoring advantage hold when concept vectors themselves become attack targets?
Note the universality finding: concept vectors transfer cross-language and cross-model — this strengthens the collective superintelligence monitoring argument (diverse providers can use shared concept vectors)

Context: Beaglehole et al. are from UCSD. Published alongside the SPAR neural circuit breaker work (concurrent but independent convergence). The Science publication suggests this approach will get wide adoption — making the dual-use implications more urgent.

Curator Notes (structured handoff for extractor)

PRIMARY CONNECTION: scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps — this paper provides evidence that representation-based monitoring extends the oversight runway relative to debate/judge-based approaches

WHY ARCHIVED: Empirical evidence that representation monitoring outperforms behavioral monitoring; paired with SCAV dual-use finding, creates a complete picture of the representation monitoring landscape

EXTRACTION HINT: Extract two claims: (1) the monitoring superiority claim, (2) a paired dual-use claim connecting Beaglehole monitoring with SCAV attack — propose a divergence between monitoring effectiveness and monitoring security

5.5 KiB Raw Blame History

Content

Agent Notes

Curator Notes (structured handoff for extractor)

5.5 KiB

Raw Blame History