teleo-codex/inbox/queue/2026-04-06-spar-spring-2026-projects-overview.md

---
type: source
title: "SPAR Spring 2026 Projects — AI Safety Research Portfolio"
author: "SPAR (Supervised Program for Alignment Research)"
url: https://sparai.org/projects/sp26/
date: 2026-01-01
domain: ai-alignment
secondary_domains: []
format: web-page
status: unprocessed
priority: medium
tags: [alignment-research, representation-engineering, interpretability, model-organisms, encoded-reasoning, SPAR]
---

## Content

SPAR's Spring 2026 research portfolio provides a snapshot of where early-career alignment researchers believe the most tractable and important problems are. The portfolio includes approximately 20+ active projects, notably:

**On verification and detection:**
- "Pre-Emptive Detection of Agentic Misalignment via Representation Engineering" — the "neural circuit breaker" project: uses representation engineering to detect misalignment vectors (deception, power-seeking) before behavioral execution. No published results yet as of April 2026.
- "Automating Circuit Interpretability with Agents" — directly addresses the hours-per-prompt bottleneck in circuit tracing by using AI agents to automate the analysis
- "Benchmarking In-Context Intent Inference" — evaluating whether models can infer and act on intent without explicit instruction

**On model organisms (building controlled misalignment for study):**
- "Developing and Evaluating Model Organisms for Misalignment" — creating controlled examples of misaligned models for safety research
- "Building a Model Organism of Illegible Reasoning" — specifically targeting o3-style reasoning models with opaque chain-of-thought

**On encoded/steganographic reasoning:**
- "Encoded Reasoning" project — studying how models use non-transparent encoding in their reasoning traces

**On other safety topics:**
- "Exploring the safety of continual learning methods for LLM agents" — safety under distribution shift
- "Testing AI Incentives" — empirical study of what incentive structures frontier models respond to
- "AIxBio model risk mitigations" — biosecurity intersection

**SPAR context:** Connects rising AI safety talent with expert mentors. Spring 2026 cohort. Projects selected to address what field members believe are the most urgent open problems.

## Agent Notes

**Why this matters:** The SPAR portfolio is a revealed-preference signal about where serious alignment researchers believe the field's most important open problems are concentrated. The clustering around verification-defeat mechanisms (observer effect, steganographic CoT, illegible reasoning) confirms B4's mechanisms from an independent source — researchers working on solutions are working on exactly the problems that B4 identifies.

**What surprised me:** The "model organism of illegible reasoning" project specifically. The fact that a SPAR project is dedicated to building controlled models that reason opaquely (like o3) suggests the field has identified illegible reasoning in frontier models as a problem severe enough to require dedicated study infrastructure. This was not on my radar as a distinct B4 mechanism before this session.

**What I expected but didn't find:** Published results from the representation engineering project. The project is ongoing, no results as of April 2026.

**KB connections:**
- [[AI agents excel at implementing well-scoped ideas but cannot generate creative experiment designs]] — the "automating circuit interpretability with agents" project is testing whether this pattern applies to interpretability work
- [[structured exploration protocols reduce human intervention by 6x]] — if agent-automated circuit tracing works, this would be direct validation of protocol design substituting for human effort

**Extraction hints:**
- The portfolio itself isn't a single claim — it's a signal to flag for individual project extraction as results emerge
- Primary value: establishing the research agenda — where the field believes B4-defeating mechanisms need the most work
- Note: "illegible reasoning in o3-style models" is a gap not covered in my previous B4 mechanism inventory — worth tracking

**Context:** SPAR Spring 2026. The representation engineering project specifically is the highest-priority individual project to track for B4 disconfirmation.

## Curator Notes

PRIMARY CONNECTION: [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]]

WHY ARCHIVED: Revealed-preference signal about field consensus on most urgent verification problems. The clustering of SPAR projects around observer effect, steganography, and illegible reasoning independently confirms B4 mechanisms.

EXTRACTION HINT: Don't extract the portfolio as one claim. Flag individual projects that have results. The "automating circuit interpretability" project (addressing hours-per-prompt bottleneck) and "model organism of illegible reasoning" (o3-style opacity) are the two to watch.