Sync Graph Data to teleo-app / sync (push) Waiting to run

Details

theseus: extract claims from 2026-03-25-epoch-ai-biorisk-benchmarks-real-world-gap

- Source: inbox/queue/2026-03-25-epoch-ai-biorisk-benchmarks-real-world-gap.md
- Domain: ai-alignment
- Claims: 2, Entities: 0
- Enrichments: 2
- Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5)

Pentagon-Agent: Theseus <PIPELINE>

2026-04-04 14:24:19 +00:00

2.7 KiB

Raw Blame History

type

domain

description

confidence

source

created

title

agent

scope

sourcer

related_claims

claim

ai-alignment

The structural gap between what AI bio benchmarks measure (virology knowledge, protocol troubleshooting) and what real bioweapon development requires (hands-on lab skills, expensive equipment, physical failure recovery) means benchmark saturation does not translate to real-world capability

likely

Epoch AI systematic analysis of lab biorisk evaluations, SecureBio VCT design principles

2026-04-04

Bio capability benchmarks measure text-accessible knowledge stages of bioweapon development but cannot evaluate somatic tacit knowledge, physical infrastructure access, or iterative laboratory failure recovery making high benchmark scores insufficient evidence for operational bioweapon development capability

theseus

structural

@EpochAIResearch

AI lowers the expertise barrier for engineering biological weapons from PhD-level to amateur which makes bioterrorism the most proximate AI-enabled existential risk

pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations

Bio capability benchmarks measure text-accessible knowledge stages of bioweapon development but cannot evaluate somatic tacit knowledge, physical infrastructure access, or iterative laboratory failure recovery making high benchmark scores insufficient evidence for operational bioweapon development capability

Epoch AI's systematic analysis identifies four critical capabilities required for bioweapon development that benchmarks cannot measure: (1) Somatic tacit knowledge - hands-on experimental skills that text cannot convey or evaluate, described as 'learning by doing'; (2) Physical infrastructure - synthetic virus development requires 'well-equipped molecular virology laboratories that are expensive to assemble and operate'; (3) Iterative physical failure recovery - real development involves failures requiring physical troubleshooting that text-based scenarios cannot simulate; (4) Stage coordination - ideation through deployment involves acquisition, synthesis, weaponization steps with physical dependencies. Even the strongest benchmark (SecureBio's VCT, which explicitly targets tacit knowledge with questions unavailable online) only measures whether AI can answer questions about these processes, not whether it can execute them. The authors conclude existing evaluations 'do not provide strong evidence that LLMs can enable amateurs to develop bioweapons' despite frontier models now exceeding expert baselines on multiple benchmarks. This creates a fundamental measurement problem: the benchmarks measure necessary but insufficient conditions for capability.

2.7 KiB Raw Blame History

2.7 KiB

Raw Blame History