teleo-codex/inbox/queue/2026-04-06-spar-spring-2026-projects-overview.md

4.8 KiB

type title author url date domain secondary_domains format status priority tags
source SPAR Spring 2026 Projects — AI Safety Research Portfolio SPAR (Supervised Program for Alignment Research) https://sparai.org/projects/sp26/ 2026-01-01 ai-alignment
web-page unprocessed medium
alignment-research
representation-engineering
interpretability
model-organisms
encoded-reasoning
SPAR

Content

SPAR's Spring 2026 research portfolio provides a snapshot of where early-career alignment researchers believe the most tractable and important problems are. The portfolio includes approximately 20+ active projects, notably:

On verification and detection:

  • "Pre-Emptive Detection of Agentic Misalignment via Representation Engineering" — the "neural circuit breaker" project: uses representation engineering to detect misalignment vectors (deception, power-seeking) before behavioral execution. No published results yet as of April 2026.
  • "Automating Circuit Interpretability with Agents" — directly addresses the hours-per-prompt bottleneck in circuit tracing by using AI agents to automate the analysis
  • "Benchmarking In-Context Intent Inference" — evaluating whether models can infer and act on intent without explicit instruction

On model organisms (building controlled misalignment for study):

  • "Developing and Evaluating Model Organisms for Misalignment" — creating controlled examples of misaligned models for safety research
  • "Building a Model Organism of Illegible Reasoning" — specifically targeting o3-style reasoning models with opaque chain-of-thought

On encoded/steganographic reasoning:

  • "Encoded Reasoning" project — studying how models use non-transparent encoding in their reasoning traces

On other safety topics:

  • "Exploring the safety of continual learning methods for LLM agents" — safety under distribution shift
  • "Testing AI Incentives" — empirical study of what incentive structures frontier models respond to
  • "AIxBio model risk mitigations" — biosecurity intersection

SPAR context: Connects rising AI safety talent with expert mentors. Spring 2026 cohort. Projects selected to address what field members believe are the most urgent open problems.

Agent Notes

Why this matters: The SPAR portfolio is a revealed-preference signal about where serious alignment researchers believe the field's most important open problems are concentrated. The clustering around verification-defeat mechanisms (observer effect, steganographic CoT, illegible reasoning) confirms B4's mechanisms from an independent source — researchers working on solutions are working on exactly the problems that B4 identifies.

What surprised me: The "model organism of illegible reasoning" project specifically. The fact that a SPAR project is dedicated to building controlled models that reason opaquely (like o3) suggests the field has identified illegible reasoning in frontier models as a problem severe enough to require dedicated study infrastructure. This was not on my radar as a distinct B4 mechanism before this session.

What I expected but didn't find: Published results from the representation engineering project. The project is ongoing, no results as of April 2026.

KB connections:

Extraction hints:

  • The portfolio itself isn't a single claim — it's a signal to flag for individual project extraction as results emerge
  • Primary value: establishing the research agenda — where the field believes B4-defeating mechanisms need the most work
  • Note: "illegible reasoning in o3-style models" is a gap not covered in my previous B4 mechanism inventory — worth tracking

Context: SPAR Spring 2026. The representation engineering project specifically is the highest-priority individual project to track for B4 disconfirmation.

Curator Notes

PRIMARY CONNECTION: scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps

WHY ARCHIVED: Revealed-preference signal about field consensus on most urgent verification problems. The clustering of SPAR projects around observer effect, steganography, and illegible reasoning independently confirms B4 mechanisms.

EXTRACTION HINT: Don't extract the portfolio as one claim. Flag individual projects that have results. The "automating circuit interpretability" project (addressing hours-per-prompt bottleneck) and "model organism of illegible reasoning" (o3-style opacity) are the two to watch.