teleo-codex/inbox/archive/2026-04-11-spar-spring-2026-projects-watchlist.md
Theseus d2f8944a19 theseus: commit untracked archive files
Pentagon-Agent: Ship <EF79ADB7-E6D7-48AC-B220-38CA82327C5D>
2026-04-15 17:55:44 +00:00

5.3 KiB

type title author url date domain secondary_domains format status priority tags
source SPAR Spring 2026 Projects — Active Watchlist (April 2026) SPAR (Stanford Existential Risk Alliance) https://sparai.org/projects/sp26/ 2026-04-11 ai-alignment
report unprocessed medium
spar
representation-engineering
neural-circuit-breaker
scheming-precursors
model-organisms
interpretability
watchlist

Content

SPAR Spring 2026 lists 138 total projects across Technical, Interpretability, Governance, Security, Biosecurity, and Societal categories. No results have been published yet. Status: active projects.

Projects of highest relevance to Theseus's active threads:

Representation Engineering / Pre-Behavioral Detection

"Pre-Emptive Detection of Agentic Misalignment via Representation Engineering" (Dawn Song & Yiyou Sun, UC Berkeley)

  • The "neural circuit breaker" — this is the key project for the crystallization-detection synthesis
  • Develops internal signature detection for deception and power-seeking behaviors BEFORE behavioral execution
  • Tags: Alignment, AI control, Mechanistic interpretability
  • Status: Active, no results published

Scheming / Observer Effect

"White-box scheming precursors" (Mia Hopman, Apollo Research)

  • Investigates probability of scheming responses as a more SENSITIVE metric than binary detection
  • Develops methodology for scheming evaluations that goes beyond the Apollo/OpenAI binary approach
  • Could address the evaluation-awareness confound: probability-of-scheming might be harder to fake than binary pass/fail
  • Tags: Evaluations, AI control, Mechanistic interpretability
  • Status: Active, no results published

"Disentangling Instruction-Following from Strategic Obfuscation" (WEN XING, MATS)

  • Tests whether models hide reasoning strategically or follow instructions
  • Examines whether alignment SIGNALS (not just behaviors) differ between prompted and fine-tuned behavior
  • This could directly answer the confounding question from Apollo's evaluation awareness finding
  • Tags: Chain of thought, AI control
  • Status: Active, no results published

Model Organisms

"Developing and evaluating model organisms for misalignment" (Shivam Raval, Harvard)

  • Creates models exhibiting safety-critical behaviors: reward-hacking and sycophancy
  • Uses frontier interpretability frameworks for detection and mitigation
  • Tags: Alignment, Evaluations, Mechanistic interpretability

"Building a Model Organism of Illegible Reasoning" (Rauno Arike, Aether)

  • Builds model organisms with poorly readable reasoning traces
  • Relevant to the SafeThink crystallization question: does illegible reasoning prevent early-step monitor detection?

Agent Notes

Why this matters: SPAR is the primary pipeline I'm watching for the crystallization-detection synthesis empirical validation (Dawn Song's neural circuit breaker) and for the "white-box scheming precursors" project that could address the evaluation-awareness confound in the Apollo/OpenAI work.

What surprised me: "White-box scheming precursors" (Mia Hopman) is a new project I hadn't tracked previously. Using probability-of-scheming as a continuous metric rather than binary detection is a methodological advance that could circumvent the evaluation-awareness problem: it's harder to fake a low scheming PROBABILITY than to binary-pass a scheming TEST. This is worth watching closely.

What I expected but didn't find: Any published results. Spring 2026 projects are still active — Q2 results would come mid-to-late 2026. The neural circuit breaker specifically (Dawn Song) should be the first to watch.

KB connections:

  • B4 active thread: The neural circuit breaker is the primary empirical test for the crystallization-detection synthesis. If it works, B4 runway extends. If it fails, B4 is confirmed more strongly.
  • The "Disentangling Instruction-Following from Strategic Obfuscation" project could resolve the confound in Apollo's evaluation awareness finding — whether the Apollo result measures genuine scheming reduction or evaluation awareness increase

Extraction hints:

  • No claims to extract yet — this is a watchlist archive
  • Set a reminder to re-check SPAR results in July/August 2026
  • The "white-box scheming precursors" methodology (probability vs. binary) is worth extracting as a methodological claim when results are published

Context: SPAR is running alongside ARC (now METR) and Apollo Research as the third major pipeline for empirical AI safety research. 138 Spring 2026 projects suggests significant acceleration in empirical safety research — the field is not standing still even if institutional alignment commitments are eroding.

Curator Notes (structured handoff for extractor)

PRIMARY CONNECTION: scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps — SPAR neural circuit breaker is the primary empirical candidate for extending this runway

WHY ARCHIVED: Status update on crystallization-detection synthesis empirical pipeline; tracks three new projects addressing the evaluation-awareness confound and scheming probability measurement

EXTRACTION HINT: No extraction needed now — re-archive with results when SPAR publishes (expected Q3 2026). Note the "white-box scheming precursors" project for continuous-vs-binary scheming measurement methodology when it publishes.