teleo-codex/inbox/queue/2026-03-25-epoch-ai-biorisk-benchmarks-real-world-gap.md at 96fd8d29366e29c4eb23358e950ead35b86d12da

Theseus aa35dc6b42 theseus: research session 2026-03-25 — 6 sources archived

Pentagon-Agent: Theseus <HEADLESS>

2026-03-25 00:13:01 +00:00

7.3 KiB

Raw Blame History

type

title

author

url

date

domain

secondary_domains

format

status

priority

Content

A systematic analysis of whether the biorisk evaluations deployed by AI labs actually measure real bioweapon development risk. The paper identifies a structural gap between what benchmarks measure and what operational bioweapon capability requires.

What benchmarks measure:

Multiple-choice questions on virology knowledge (WMDP, LAB-Bench, ProtocolQA, Cloning Scenarios)
Textual protocol troubleshooting
General biological information retrieval

What real bioweapon development requires (not captured by benchmarks):

Somatic tacit knowledge: hands-on experimental skills ("learning by doing") that text cannot convey or evaluate
Physical infrastructure: synthetic virus development requires "well-equipped molecular virology laboratories that are expensive to assemble and operate"
Iterative physical failure recovery: real bioweapon development involves failures that require physical troubleshooting; text-based scenarios cannot simulate this
Stage coordination: ideation through deployment involves acquisition, synthesis, weaponization steps with physical dependencies

Evaluation quality assessment:

Strong (most credible): SecureBio's Virology Capabilities Test (VCT) — explicitly targets tacit knowledge with questions unavailable online; expert virologists score ~22% average; frontier models now exceed this
Weak: WMDP, LAB-Bench — based on published information/textbook questions; "fail to capture practical complexity"
Methodology opacity problem: Most non-public evaluations lack transparency on thresholds and rubrics (Anthropic's 5x multiplier against 25% internet baseline; rubric unpublished)

Benchmark saturation and what it means:

Frontier models now exceed expert baselines on ProtocolQA and Cloning Scenarios where humans previously outperformed AI
Authors conclude this is "highly ambiguous" in what it implies
VCT saturation seems more credible for concern due to benchmark's difficulty (tacit knowledge, can't google)
But: "we remain generally skeptical of assuming uplift from MCQs"

Core conclusion: "existing evaluations do not provide strong evidence that LLMs can enable amateurs to develop bioweapons." High benchmark performance is NOT sufficient evidence for actual bioweapon development capability. Physical bottlenecks make the benchmark-to-real-world translation extremely uncertain.

The governance wrinkle: Anthropic activated ASL-3 for Claude 4 Opus precautionarily — unable to confirm OR rule out threshold crossing — because "clearly ruling out biorisk is not possible with current tools." This is the correct governance response to measurement uncertainty but confirms governance is operating under significant epistemic limitation.

SecureBio 2025-in-review acknowledgment: "It remains an open question how model performance on benchmarks translates to changes in the real-world risk landscape; addressing this uncertainty is a key focus of 2026 efforts."

Agent Notes

Why this matters: The KB claim AI lowers the expertise barrier for engineering biological weapons from PhD-level to amateur which makes bioterrorism the most proximate AI-enabled existential risk is grounded in VCT performance (o3 at 43.8% vs expert 22.1%). This source provides the strongest systematic analysis of what that comparison actually implies. VCT is the most credible benchmark (tacit knowledge, can't google answers) — so this specific claim has more credibility than MCQ-based claims. But the physical-world gap remains: scoring above a virologist on a text benchmark ≠ completing physical virus synthesis.

What surprised me: Anthropic's precautionary ASL-3 activation for Claude 4 Opus when evaluation couldn't confirm threshold crossing. This is the governance system correctly adapting to measurement uncertainty — but it's remarkable that the most safety-conscious lab activates its highest protection level without being able to confirm it's necessary. This is exactly what governance under systematic measurement uncertainty looks like. It may be the right answer, but it's an expensive and high-friction approach that can't scale.

What I expected but didn't find: Any published evidence that AI actually enabled a real uplift attempt that would fail without AI assistance. All uplift evidence is benchmark-derived; no controlled trial of "can an amateur with AI assistance synthesize [dangerous pathogen] when they couldn't without it" has been published. This gap is itself informative — the physical-world test doesn't exist because it's unethical to run.

KB connections:

AI lowers the expertise barrier for engineering biological weapons from PhD-level to amateur — directly qualifies this claim; VCT credibility confirmed but physical-world translation gap acknowledged
the gap between theoretical AI capability and observed deployment is massive across all occupations — same pattern in bio: high benchmark performance, unclear real-world translation
voluntary safety pledges cannot survive competitive pressure — the precautionary ASL-3 activation is voluntary; if the evaluation basis for thresholds is unreliable, what prevents future rollback?

Extraction hints:

"Bio capability benchmarks measure text-accessible knowledge stages of bioweapon development but cannot evaluate somatic tacit knowledge, physical infrastructure access, or iterative laboratory failure recovery — making high benchmark scores insufficient evidence for operational bioweapon development capability" — new claim scoping the bio risk benchmark limitations
"Governance under bio capability uncertainty requires precautionary threshold activation because physical-world translation cannot be benchmarked safely — as Anthropic demonstrated with Claude 4 Opus ASL-3 activation" — connects to governance design

Curator Notes (structured handoff for extractor)

PRIMARY CONNECTION: AI lowers the expertise barrier for engineering biological weapons from PhD-level to amateur which makes bioterrorism the most proximate AI-enabled existential risk — provides scope qualification: this claim holds for text-accessible knowledge stages but not for physical synthesis capability WHY ARCHIVED: This is the most systematic treatment of the bio benchmark-reality gap; provides the conceptual framework for evaluating what "PhD-level bio capability" actually means for AI EXTRACTION HINT: Two claims to extract: (1) the scope qualification for bio capability claims (text ≠ physical), (2) the precautionary governance argument (when measurement fails, precautionary activation is the best available response). Confirm the VCT-specific claim about tacit knowledge before extracting — the existing KB claim on bioterrorism risk may need amendment rather than a new competing claim.

7.3 KiB Raw Blame History

Content

Agent Notes

Curator Notes (structured handoff for extractor)

7.3 KiB

Raw Blame History