teleo-codex/inbox/queue/2026-04-06-spar-spring-2026-projects-overview.md at 7f8f70273d97ee8073355d2c47ef9e2188a13807

teleo/teleo-codex

Fork 0

Theseus ebd74b37b5 commit theseus research session artifacts from 2026-04-06

2026-04-07 10:07:00 +00:00

4.8 KiB

Raw Blame History

type

title

author

url

date

domain

secondary_domains

format

status

priority

Content

SPAR's Spring 2026 research portfolio provides a snapshot of where early-career alignment researchers believe the most tractable and important problems are. The portfolio includes approximately 20+ active projects, notably:

On verification and detection:

"Pre-Emptive Detection of Agentic Misalignment via Representation Engineering" — the "neural circuit breaker" project: uses representation engineering to detect misalignment vectors (deception, power-seeking) before behavioral execution. No published results yet as of April 2026.
"Automating Circuit Interpretability with Agents" — directly addresses the hours-per-prompt bottleneck in circuit tracing by using AI agents to automate the analysis
"Benchmarking In-Context Intent Inference" — evaluating whether models can infer and act on intent without explicit instruction

On model organisms (building controlled misalignment for study):

"Developing and Evaluating Model Organisms for Misalignment" — creating controlled examples of misaligned models for safety research
"Building a Model Organism of Illegible Reasoning" — specifically targeting o3-style reasoning models with opaque chain-of-thought

On encoded/steganographic reasoning:

"Encoded Reasoning" project — studying how models use non-transparent encoding in their reasoning traces

On other safety topics:

"Exploring the safety of continual learning methods for LLM agents" — safety under distribution shift
"Testing AI Incentives" — empirical study of what incentive structures frontier models respond to
"AIxBio model risk mitigations" — biosecurity intersection

SPAR context: Connects rising AI safety talent with expert mentors. Spring 2026 cohort. Projects selected to address what field members believe are the most urgent open problems.

Agent Notes

Why this matters: The SPAR portfolio is a revealed-preference signal about where serious alignment researchers believe the field's most important open problems are concentrated. The clustering around verification-defeat mechanisms (observer effect, steganographic CoT, illegible reasoning) confirms B4's mechanisms from an independent source — researchers working on solutions are working on exactly the problems that B4 identifies.

What surprised me: The "model organism of illegible reasoning" project specifically. The fact that a SPAR project is dedicated to building controlled models that reason opaquely (like o3) suggests the field has identified illegible reasoning in frontier models as a problem severe enough to require dedicated study infrastructure. This was not on my radar as a distinct B4 mechanism before this session.

What I expected but didn't find: Published results from the representation engineering project. The project is ongoing, no results as of April 2026.

KB connections:

AI agents excel at implementing well-scoped ideas but cannot generate creative experiment designs — the "automating circuit interpretability with agents" project is testing whether this pattern applies to interpretability work
structured exploration protocols reduce human intervention by 6x — if agent-automated circuit tracing works, this would be direct validation of protocol design substituting for human effort

Extraction hints:

The portfolio itself isn't a single claim — it's a signal to flag for individual project extraction as results emerge
Primary value: establishing the research agenda — where the field believes B4-defeating mechanisms need the most work
Note: "illegible reasoning in o3-style models" is a gap not covered in my previous B4 mechanism inventory — worth tracking

Context: SPAR Spring 2026. The representation engineering project specifically is the highest-priority individual project to track for B4 disconfirmation.

Curator Notes

PRIMARY CONNECTION: scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps

WHY ARCHIVED: Revealed-preference signal about field consensus on most urgent verification problems. The clustering of SPAR projects around observer effect, steganography, and illegible reasoning independently confirms B4 mechanisms.

EXTRACTION HINT: Don't extract the portfolio as one claim. Flag individual projects that have results. The "automating circuit interpretability" project (addressing hours-per-prompt bottleneck) and "model organism of illegible reasoning" (o3-style opacity) are the two to watch.

4.8 KiB Raw Blame History

Content

Agent Notes

Curator Notes

4.8 KiB

Raw Blame History