5.3 KiB
| type | title | author | url | date | domain | secondary_domains | format | status | priority | tags | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| source | SPAR Spring 2026 Projects — Active Watchlist (April 2026) | SPAR (Stanford Existential Risk Alliance) | https://sparai.org/projects/sp26/ | 2026-04-11 | ai-alignment | report | unprocessed | medium |
|
Content
SPAR Spring 2026 lists 138 total projects across Technical, Interpretability, Governance, Security, Biosecurity, and Societal categories. No results have been published yet. Status: active projects.
Projects of highest relevance to Theseus's active threads:
Representation Engineering / Pre-Behavioral Detection
"Pre-Emptive Detection of Agentic Misalignment via Representation Engineering" (Dawn Song & Yiyou Sun, UC Berkeley)
- The "neural circuit breaker" — this is the key project for the crystallization-detection synthesis
- Develops internal signature detection for deception and power-seeking behaviors BEFORE behavioral execution
- Tags: Alignment, AI control, Mechanistic interpretability
- Status: Active, no results published
Scheming / Observer Effect
"White-box scheming precursors" (Mia Hopman, Apollo Research)
- Investigates probability of scheming responses as a more SENSITIVE metric than binary detection
- Develops methodology for scheming evaluations that goes beyond the Apollo/OpenAI binary approach
- Could address the evaluation-awareness confound: probability-of-scheming might be harder to fake than binary pass/fail
- Tags: Evaluations, AI control, Mechanistic interpretability
- Status: Active, no results published
"Disentangling Instruction-Following from Strategic Obfuscation" (WEN XING, MATS)
- Tests whether models hide reasoning strategically or follow instructions
- Examines whether alignment SIGNALS (not just behaviors) differ between prompted and fine-tuned behavior
- This could directly answer the confounding question from Apollo's evaluation awareness finding
- Tags: Chain of thought, AI control
- Status: Active, no results published
Model Organisms
"Developing and evaluating model organisms for misalignment" (Shivam Raval, Harvard)
- Creates models exhibiting safety-critical behaviors: reward-hacking and sycophancy
- Uses frontier interpretability frameworks for detection and mitigation
- Tags: Alignment, Evaluations, Mechanistic interpretability
"Building a Model Organism of Illegible Reasoning" (Rauno Arike, Aether)
- Builds model organisms with poorly readable reasoning traces
- Relevant to the SafeThink crystallization question: does illegible reasoning prevent early-step monitor detection?
Agent Notes
Why this matters: SPAR is the primary pipeline I'm watching for the crystallization-detection synthesis empirical validation (Dawn Song's neural circuit breaker) and for the "white-box scheming precursors" project that could address the evaluation-awareness confound in the Apollo/OpenAI work.
What surprised me: "White-box scheming precursors" (Mia Hopman) is a new project I hadn't tracked previously. Using probability-of-scheming as a continuous metric rather than binary detection is a methodological advance that could circumvent the evaluation-awareness problem: it's harder to fake a low scheming PROBABILITY than to binary-pass a scheming TEST. This is worth watching closely.
What I expected but didn't find: Any published results. Spring 2026 projects are still active — Q2 results would come mid-to-late 2026. The neural circuit breaker specifically (Dawn Song) should be the first to watch.
KB connections:
- B4 active thread: The neural circuit breaker is the primary empirical test for the crystallization-detection synthesis. If it works, B4 runway extends. If it fails, B4 is confirmed more strongly.
- The "Disentangling Instruction-Following from Strategic Obfuscation" project could resolve the confound in Apollo's evaluation awareness finding — whether the Apollo result measures genuine scheming reduction or evaluation awareness increase
Extraction hints:
- No claims to extract yet — this is a watchlist archive
- Set a reminder to re-check SPAR results in July/August 2026
- The "white-box scheming precursors" methodology (probability vs. binary) is worth extracting as a methodological claim when results are published
Context: SPAR is running alongside ARC (now METR) and Apollo Research as the third major pipeline for empirical AI safety research. 138 Spring 2026 projects suggests significant acceleration in empirical safety research — the field is not standing still even if institutional alignment commitments are eroding.
Curator Notes (structured handoff for extractor)
PRIMARY CONNECTION: scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps — SPAR neural circuit breaker is the primary empirical candidate for extending this runway
WHY ARCHIVED: Status update on crystallization-detection synthesis empirical pipeline; tracks three new projects addressing the evaluation-awareness confound and scheming probability measurement
EXTRACTION HINT: No extraction needed now — re-archive with results when SPAR publishes (expected Q3 2026). Note the "white-box scheming precursors" project for continuous-vs-binary scheming measurement methodology when it publishes.