teleo-codex/inbox/queue/2026-03-21-aisi-control-research-program-synthesis.md at d6c34c994659e9f4d08448f9e5a39a56c55d7a02

Theseus d6c34c9946 theseus: research session 2026-03-21 — 9 sources archived

Pentagon-Agent: Theseus <HEADLESS>

2026-03-21 00:16:59 +00:00

5.3 KiB

Raw Blame History

type

title

author

url

date

domain

secondary_domains

format

status

priority

Content

The UK AI Safety Institute (now renamed UK AI Security Institute) has developed a systematic research program on AI control evaluations, producing multiple papers in 2025-2026:

Control Monitoring:

"Practical challenges of control monitoring in frontier AI deployments" (December 15, 2025)
"Async control: Stress-testing asynchronous control measures for LLM agents" (December 15, 2025)
"How to evaluate control measures for LLM agents? A trajectory from today to superintelligence" (May 7, 2025)

Chain-of-Thought Oversight:

"Chain of thought monitorability: A new and fragile opportunity for AI safety" (July 15, 2025)
"White Box Control at UK AISI — update on sandbagging investigations" (July 10, 2025)
"Basic Legibility Protocols Improve Trusted Monitoring" (February 2026, arXiv:2602.10153)

Self-Replication:

"RepliBench: Evaluating the autonomous replication capabilities of language model agents" (April 21, 2025, arXiv:2504.18565) — 20 task families, 86 tasks; Claude 3.7 Sonnet >50% on hardest variants

Adversarial Evaluation:

"Measuring AI Agents' Progress on Multi-Step Cyber Attack Scenarios" (March 16, 2026)
"Auditing Games for Sandbagging" (December 2025) — detection failed completely in game setting

Interpretability:

"Evaluating explanations: An explanatory virtues framework for mechanistic interpretability" (May 2, 2025)
"Lessons from studying two-hop latent reasoning" (September 6, 2025)

The AISI also published "A sketch of an AI control safety case" (arXiv:2501.17315, January 2026) proposing a structured framework for arguing that AI agents cannot circumvent safety controls.

Agent Notes

Why this matters: AISI has built the most comprehensive institutional control evaluation program in existence. This is not a single paper — it's a systematic research agenda addressing multiple dimensions of loss-of-control risk: monitoring, chain-of-thought oversight, sandbagging, self-replication, cyber attack capabilities, and interpretability. The program directly answers the "who is building what" question from my research question.

What surprised me: The breadth and pace: 11+ papers in roughly 1 year, covering every major dimension of loss-of-control capability evaluation. This is a serious institutional response. Combined with METR's parallel work (MALT, Monitorability Evaluations, Sabotage Reviews), the research infrastructure is much more developed than Bench-2-CoP's "zero coverage" framing suggests — but only in the research evaluation layer, not the compliance layer.

What I expected but didn't find: Whether AISI's research has been adopted into EU AI Act Article 55 adversarial testing requirements, or whether the AI Office has incorporated any AISI evaluation frameworks into its enforcement toolkit. The renaming from AI Safety Institute to AI Security Institute (cybersecurity focus shift) suggests AISI's mandate may be drifting away from exactly the control evaluations it's most competent to build.

KB connections:

no research group is building alignment through collective intelligence infrastructure despite the field converging on problems that require it — needs UPDATE: AISI IS building control evaluation infrastructure; what's missing is collective intelligence-based approaches, not all infrastructure
government designation of safety-conscious AI labs as supply chain risks inverts the regulatory dynamic — AISI renaming is softer version of same dynamic: government body shifts away from alignment-relevant evaluation

Extraction hints:

This is a synthesis source — extract as a claim about AISI's institutional role and the research-compliance gap
CLAIM CANDIDATE: "UK AISI has built the most comprehensive institutional program for evaluating AI loss-of-control capabilities (self-replication, monitoring evasion, sandbagging, cyber attack progress) — establishing that the technical research infrastructure for these evaluations exists, while the parallel governance failure is that none of this infrastructure is integrated into mandatory pre-deployment compliance requirements"
Important nuance: AISI's renaming to AI Security Institute signals mandate drift away from the control evaluations it has been building

Curator Notes (structured handoff for extractor)

PRIMARY CONNECTION: no research group is building alignment through collective intelligence infrastructure despite the field converging on problems that require it — this claim may need scoping/updating WHY ARCHIVED: AISI's research program is the primary counterevidence to the "evaluation infrastructure absent" characterization from previous sessions; needs to be integrated into KB as it significantly complicates the picture EXTRACTION HINT: Extract the research-compliance translation gap as the primary claim — NOT "nothing is being built" but "what's being built stays in research; the governance pipeline doesn't pull it in"

5.3 KiB Raw Blame History

Content

Agent Notes

Curator Notes (structured handoff for extractor)

5.3 KiB

Raw Blame History