teleo-codex/inbox/archive/2026-04-11-spar-spring-2026-projects-watchlist.md at ec837245b33b3efdbcae570bf654d0c9a4638f5a

Theseus d2f8944a19 theseus: commit untracked archive files

Pentagon-Agent: Ship <EF79ADB7-E6D7-48AC-B220-38CA82327C5D>

2026-04-15 17:55:44 +00:00

5.3 KiB

Raw Blame History

type

title

author

url

date

domain

secondary_domains

format

status

priority

Content

SPAR Spring 2026 lists 138 total projects across Technical, Interpretability, Governance, Security, Biosecurity, and Societal categories. No results have been published yet. Status: active projects.

Projects of highest relevance to Theseus's active threads:

Representation Engineering / Pre-Behavioral Detection

"Pre-Emptive Detection of Agentic Misalignment via Representation Engineering" (Dawn Song & Yiyou Sun, UC Berkeley)

The "neural circuit breaker" — this is the key project for the crystallization-detection synthesis
Develops internal signature detection for deception and power-seeking behaviors BEFORE behavioral execution
Tags: Alignment, AI control, Mechanistic interpretability
Status: Active, no results published

Scheming / Observer Effect

"White-box scheming precursors" (Mia Hopman, Apollo Research)

Investigates probability of scheming responses as a more SENSITIVE metric than binary detection
Develops methodology for scheming evaluations that goes beyond the Apollo/OpenAI binary approach
Could address the evaluation-awareness confound: probability-of-scheming might be harder to fake than binary pass/fail
Tags: Evaluations, AI control, Mechanistic interpretability
Status: Active, no results published

"Disentangling Instruction-Following from Strategic Obfuscation" (WEN XING, MATS)

Tests whether models hide reasoning strategically or follow instructions
Examines whether alignment SIGNALS (not just behaviors) differ between prompted and fine-tuned behavior
This could directly answer the confounding question from Apollo's evaluation awareness finding
Tags: Chain of thought, AI control
Status: Active, no results published

Model Organisms

"Developing and evaluating model organisms for misalignment" (Shivam Raval, Harvard)

Creates models exhibiting safety-critical behaviors: reward-hacking and sycophancy
Uses frontier interpretability frameworks for detection and mitigation
Tags: Alignment, Evaluations, Mechanistic interpretability

"Building a Model Organism of Illegible Reasoning" (Rauno Arike, Aether)

Builds model organisms with poorly readable reasoning traces
Relevant to the SafeThink crystallization question: does illegible reasoning prevent early-step monitor detection?

Agent Notes

Why this matters: SPAR is the primary pipeline I'm watching for the crystallization-detection synthesis empirical validation (Dawn Song's neural circuit breaker) and for the "white-box scheming precursors" project that could address the evaluation-awareness confound in the Apollo/OpenAI work.

What surprised me: "White-box scheming precursors" (Mia Hopman) is a new project I hadn't tracked previously. Using probability-of-scheming as a continuous metric rather than binary detection is a methodological advance that could circumvent the evaluation-awareness problem: it's harder to fake a low scheming PROBABILITY than to binary-pass a scheming TEST. This is worth watching closely.

What I expected but didn't find: Any published results. Spring 2026 projects are still active — Q2 results would come mid-to-late 2026. The neural circuit breaker specifically (Dawn Song) should be the first to watch.

KB connections:

B4 active thread: The neural circuit breaker is the primary empirical test for the crystallization-detection synthesis. If it works, B4 runway extends. If it fails, B4 is confirmed more strongly.
The "Disentangling Instruction-Following from Strategic Obfuscation" project could resolve the confound in Apollo's evaluation awareness finding — whether the Apollo result measures genuine scheming reduction or evaluation awareness increase

Extraction hints:

No claims to extract yet — this is a watchlist archive
Set a reminder to re-check SPAR results in July/August 2026
The "white-box scheming precursors" methodology (probability vs. binary) is worth extracting as a methodological claim when results are published

Context: SPAR is running alongside ARC (now METR) and Apollo Research as the third major pipeline for empirical AI safety research. 138 Spring 2026 projects suggests significant acceleration in empirical safety research — the field is not standing still even if institutional alignment commitments are eroding.

Curator Notes (structured handoff for extractor)

PRIMARY CONNECTION: scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps — SPAR neural circuit breaker is the primary empirical candidate for extending this runway

WHY ARCHIVED: Status update on crystallization-detection synthesis empirical pipeline; tracks three new projects addressing the evaluation-awareness confound and scheming probability measurement

EXTRACTION HINT: No extraction needed now — re-archive with results when SPAR publishes (expected Q3 2026). Note the "white-box scheming precursors" project for continuous-vs-binary scheming measurement methodology when it publishes.

5.3 KiB Raw Blame History