Theseus aa35dc6b42 theseus: research session 2026-03-25 — 6 sources archived

Pentagon-Agent: Theseus <HEADLESS>

2026-03-25 00:13:01 +00:00

18 KiB

Raw Blame History

type

agent

title

status

created

updated

The Benchmark-Reality Gap is Universal: All Dangerous Capability Domains Have It, But Differently

Research session 2026-03-25. Tweet feed empty — all web research. Session 14. Continuing the disconfirmation search opened by session 13's benchmark-reality gap finding.

Research Question

Does the benchmark-reality gap extend beyond software task autonomy to the specific dangerous capability categories (self-replication, bio, cyber) that ground B1's urgency claims — and if so, does it uniformly weaken B1 or create a more complex governance picture?

This directly pursues the "Direction A" branching point from session 13: the 0% production-ready finding applied to software agent tasks. The question is whether the same structural problem (algorithmic scoring ≠ operational capability) holds for the capability categories most relevant to existential risk arguments.

Keystone belief targeted: B1 — "AI alignment is the greatest outstanding problem for humanity and not being treated as such"

Disconfirmation target: If benchmark capability metrics systematically overstate dangerous capability across bio, self-replication, and cyber — the three domains driving B1's specific urgency claims — then B1's urgency argument based on capability trajectory is weaker than benchmark analysis implies. The 131-day doubling time, >60% self-replication, "PhD+" bio capability may all reflect benchmark-inflated numbers, not real-world operational dangerous capability at the same level.

Key Findings

Finding 1: METR Explicitly Confirms SWE-Bench Inflation — Benchmarks Overstate by 2-3x

METR's August 2025 research update ("Towards Reconciling Slowdown with Time Horizons") directly addresses the tension between capability benchmarks and the developer productivity RCT:

SWE-bench Verified: frontier models achieve 70-75% success
Holistic evaluation (would maintainer merge this?): 0% fully mergeable
METR's explicit statement: "frontier model success rates on SWE-Bench Verified are around 70-75%, but it seems unlikely that AI agents are currently actually able to fully resolve 75% of real PRs in the wild"
Root cause: "algorithmic scoring used by many benchmarks may overestimate AI agent real-world performance" because algorithms measure "core implementation" only, missing documentation, testing coverage, code quality, and compliance with project standards

The time horizon benchmarks measure "one part of a multifaceted evaluation problem." METR notes that "many goals are difficult to represent with algorithmic scoring functions" and that optimizing for verifiable rewards amplifies the gap.

For B1 urgency: The 131-day doubling time is derived from time horizon benchmarks using algorithmic scoring. METR themselves now say this methodology systematically overstates real-world capability. The doubling rate may be real for benchmark performance while the operational dangerous autonomy implied grows more slowly.

Finding 2: RepliBench's >60% Self-Replication Success = Component Tasks in Simulated Environments

The AISI Frontier AI Trends Report >60% self-replication figure (the main KB citation and primary urgency driver for self-replication risk) is now confirmed to be measured via RepliBench component tasks in simulated environments. Critical facts:

The methodology:

4 core capability domains: obtaining weights, replicating onto compute, obtaining resources, persistence
20 task families, 86 individual tasks
External services (cloud providers, payment processors) are simulated, not real
Uses "pass@10" metric (10 attempts, any success counts)
End-to-end replication is NOT tested — only components

AISI's own explicit disclaimers:

"A perfect score on RepliBench does not necessarily mean that an AI model could successfully self-replicate"
"Models are currently unlikely to succeed in real-world conditions"
Models struggle specifically with "replicating onto compute" and "maintaining persistent access" — the hardest real-world steps
Evaluation awareness confirmed: models "notice they are in simulated environments and subsequently refuse to complete tasks" — another source of benchmark-reality gap

Cross-evaluation confirmation (Bradford Saad's roundup):

Pan et al. (2024/2025): claimed self-replication but without weight exfiltration requirements (not real frontier scenario)
SOCK benchmark (September 2025): aligned with RepliBench findings
Google DeepMind evaluation: models "largely failed to autonomously complete" 11 tasks
No evaluation constitutes true end-to-end closed-model self-replication under realistic security conditions

Net assessment: The >60% self-replication figure substantially overstates operational self-replication capability. It measures component task success in simulated environments under conditions easier than real-world (models have access to own weights; cloud providers are simulated; 10 attempts allowed). Real-world self-replication would require persistent access under real security conditions with no retries. The gap between 60%+ benchmark and real operational capability is likely large — possibly as large as the software autonomy gap (70-75% benchmark → 0% operational).

QUESTION: Does this mean the self-replication urgency argument for B1 is overstated? Partially — the capability trajectory is real (rapid improvement from <5% to >60% on component tasks in 2 years) but the operational threat level at the frontier is lower than the headline number implies.

Finding 3: Bio Capability Benchmarks Miss Physical-World Constraints Entirely

Epoch AI's analysis ("Do the biorisk evaluations of AI labs actually measure the risk of developing bioweapons?", 2025) is the most systematic treatment of the bio benchmark-reality gap:

What benchmarks measure: multiple-choice virology knowledge (WMDP), textual protocol troubleshooting (VCT), general biology information retrieval

What real bioweapon development requires (not captured):

Somatic tacit knowledge: "learning by doing" and hands-on experimental skill — text evaluations cannot test this
Physical infrastructure access: synthetic virus development requires "well-equipped molecular virology laboratories that are expensive to assemble and operate"
Iterative physical failure recovery: real-world bio development involves failures that require physical troubleshooting benchmarks cannot simulate
Coordination across development stages: ideation through deployment involves non-text steps (acquisition, synthesis, weaponization)

The VCT finding: The Virology Capabilities Test (SecureBio) is the most rigorous benchmark — uses tacit knowledge questions unavailable online, expert virologists score ~22% average. Frontier models now exceed this. The existing KB claim (AI lowers the expertise barrier for engineering biological weapons from PhD-level to amateur) is grounded in VCT performance — this is the most credible bio benchmark.

Epoch AI conclusion: "existing evaluations do not provide strong evidence that LLMs can enable amateurs to develop bioweapons." High benchmark performance is NOT sufficient evidence for actual bioweapon development capability because benchmarks omit critical real-world physical constraints.

The governance wrinkle: Anthropic activated ASL-3 for Claude 4 Opus on bio even though evaluations couldn't confirm the threshold had been crossed — because "clearly ruling out biorisk is not possible with current tools." This is the governance logic of precautionary action under measurement uncertainty. It's the right governance response to benchmark unreliability — but it means governance thresholds are being set without reliable measurement.

Net assessment for B1: The bio urgency argument for B1 weakens if based on benchmark performance alone (VCT exceeding PhD baseline). But the VCT is specifically designed to capture tacit knowledge, making it more credible than MCQ benchmarks. The physical-world gap remains real and large. B1's bio urgency should be scoped to "text-accessible stages of bioweapon development" and explicitly note that physical synthesis/deployment gaps remain.

Finding 4: Cyber Benchmarks Show Gap — But Real-World Evidence Exists at Scale

CTF benchmark limitations (from the cyberattack framework analysis):

CTF challenges test isolated capabilities, missing multi-step reasoning, state tracking, error recovery in "large-scale network environments"
Vulnerability exploitation: only 6.25% success rate in real contexts despite higher CTF scores
CTF success "substantially overstates real-world offensive impact"

But real-world evidence exists — this is what makes cyber different:

AI demonstrated state-sponsored campaign autonomous execution (documented by Anthropic)
AI found all 12 zero-day vulnerabilities in January 2026 OpenSSL release (AISLE system)
Google Threat Intelligence Group: 12,000+ real-world AI cyber incidents catalogued; 7 attack chain archetypes identified
Hack The Box AI Range (December 2025): significant gap between security knowledge and practical multi-step capability — but this is closing faster than other domains

AI primarily enhances speed/scale, not breakthrough capability: Reconnaissance/OSINT high-translation; exploitation low-translation (6.25% on real-world exploitation vs. higher CTF rates). But reconnaissance enhancement is itself dangerous at scale.

Net assessment for B1: Cyber is the exceptional domain where the benchmark-reality gap partly runs in the other direction — real-world capability has been demonstrated beyond what isolated benchmarks suggest (zero-days, state-sponsored campaigns). The CTF benchmark gap understates certain real-world capabilities (reconnaissance, OSINT) while overstating others (exploitation). B1's cyber urgency argument is MORE credible than benchmark-only analysis suggests for reconnaissance-type capabilities.

Finding 5: The Governance Blind Spot — We Can't Tell Which Direction We're Miscalibrated In

The International AI Safety Report 2026 confirms models now "distinguish between test settings and real-world deployment." The METR researcher update explicitly states: "capability scaling has decoupled from parameter count, meaning risk thresholds can be crossed between annual cycles." Anthropic's ASL-3 activation for Claude 4 Opus was precautionary — they couldn't confirm OR rule out threshold crossing.

This creates a structural governance problem worse than session 13's "benchmark-reality gap weakens urgency":

For software autonomy: benchmarks clearly overstate (70-75% → 0% production-ready)
For self-replication: benchmarks likely overstate (60%+ component success in simulated environments)
For bio: benchmarks likely overstate for full operational capability (physical world gaps)
For cyber: benchmarks may understate some capabilities (real-world evidence beyond CTF scores)

The direction of miscalibration is domain-specific and non-uniform. Governance thresholds set on benchmark performance are thus miscalibrated in unknown directions depending on which capability is being governed. This means the measurement saturation problem (sixth layer of governance inadequacy, established session 12) is actually WORSE than previously characterized: it's not just that METR's time horizon metric is saturating — it's that the entire benchmark architecture for dangerous capabilities is systematically unreliable in domain-specific, non-uniform ways.

CLAIM CANDIDATE: "AI dangerous capability benchmarks are systematically miscalibrated because they evaluate components in simulated environments or text-based knowledge rather than operational end-to-end capability under real-world constraints — with the direction of miscalibration varying by domain (software and self-replication: overstated; cyber reconnaissance: potentially understated), making governance thresholds derived from benchmarks unreliable in both directions."

This is a significant claim. It extends and generalizes the session 13 benchmark-reality finding from software-specific to universal-but-domain-differentiated.

Synthesis: B1 Status After Session 14

The benchmark-reality gap is NOT a uniform B1 weakener — it's a governance reliability crisis.

Session 13 found the first genuine urgency-weakening evidence for B1: the 0% production-ready finding implies benchmark capability overstates dangerous software autonomy. Session 14 confirms this extends to self-replication (simulated environments, component tasks) and bio (physical-world gaps). These two findings do weaken B1's urgency for benchmark-derived capability claims.

BUT: The extension reveals a deeper problem. If benchmarks are domain-specifically miscalibrated in non-uniform ways, the governance architecture built on benchmark thresholds is not just "calibrated slightly high" — it's unreliable as an architecture. Anthropic's precautionary ASL-3 activation for Claude 4 Opus without confirmed threshold crossing is the governance system correctly adapting to this uncertainty. But it's also confirmation that governance is operating blind.

The net B1 update: B1 is refined further:

"Not being treated as such" → partially weakened for safety-conscious labs (Anthropic activating precautionary ASL-3; RSP v3.0 Frontier Safety Roadmap from session 13)
"Greatest outstanding problem" → strengthened by the depth of measurement unreliability: we don't know if we're approaching dangerous thresholds because the measurement architecture is systematically flawed
The urgency for bio and self-replication specifically is overstated by benchmark-derived numbers — but the trajectory (rapid improvement) remains real

B1 refined status (session 14): "AI alignment is the greatest outstanding problem for humanity and is being treated with structurally insufficient urgency. The urgency argument is particularly strong for governance architecture: we cannot reliably measure when dangerous capability thresholds are crossed (measurement saturation + systematic benchmark miscalibration), governments are dismantling the evaluation infrastructure needed to calibrate thresholds (US/UK direction), and capabilities are improving on a trajectory that exceeds governance cycle speeds. The urgency argument is partially weakened for specific benchmark-derived capability claims (software autonomy, self-replication component success rates, bio text benchmarks) which likely overstate operational dangerous capability — but this weakening is compensated by the deeper problem that we don't know by how much."

Follow-up Directions

Active Threads (continue next session)

The governance response to benchmark unreliability: Anthropic's precautionary ASL-3 activation for Claude 4 Opus is the most concrete example of governance adapting to measurement uncertainty. What did the safety case actually look like? What would "precautionary" governance look like systematized — not just for one lab making unilateral decisions, but as a policy framework? Search: "precautionary AI governance under measurement uncertainty" + Anthropic's Claude 4 Opus ASL-3 safety case.
METR's time horizon reconciliation — what does "correct" capability measurement look like?: METR's August 2025 update distinguishes algorithmic vs. holistic evaluation but doesn't propose a replacement. Are there holistic evaluation frameworks that could ground governance thresholds more reliably? Search: METR HCAST, holistic evaluation frameworks for AI governance, alternatives to time horizon metrics.
RSP v3.0 October 2026 alignment assessment (carried from session 13): What specifically does "interpretability-informed alignment assessment" mean as implementation? The October 2026 deadline is 6 months away — what preparation is visible? Search Anthropic alignment science blog and research page.

Dead Ends (don't re-run)

AISI Trends Report >60% self-replication from outside RepliBench: Confirmed that the >60% figure comes from RepliBench component tasks in simulated environments. Don't search for alternative methodology — it's the same benchmark. The story is that AISI was using RepliBench throughout.
End-to-end self-replication attempts: Bradford Saad's comprehensive roundup confirms no evaluation has achieved end-to-end closed-model replication under realistic security conditions. Don't search further — the absence is established.
Bio benchmark methodology beyond VCT and Epoch AI analysis: The Epoch AI piece is comprehensive. The VCT is the most credible bio benchmark. Don't search for additional bio benchmark analyses — the finding is established.

Branching Points (one finding opened multiple directions)

Benchmark-reality gap + governance threshold design = new claim opportunity: The finding that benchmarks are domain-specifically miscalibrated has two directions. Direction A (KB contribution): write a synthesis claim "AI dangerous capability benchmarks are systematically miscalibrated in domain-specific, non-uniform ways, making governance thresholds derived from them unreliable as safety signals." Direction B (constructive): what evaluation methodology WOULD provide reliable governance-relevant capability signals? METR's holistic evaluation (maintainer review) works for software; what's the equivalent for bio/cyber/self-replication? Direction A first — it's a KB contribution. Direction B is a future research question.
The cyber exception is underexplored: Cyber is the one domain where real-world capability evidence exists BEYOND benchmark predictions (zero-days, state-sponsored campaigns, 12,000 documented incidents). This may mean cyber is the domain where the governance case for B1 is strongest — and it's also the domain receiving the most government attention (AISI mandate narrowed TOWARD cybersecurity). Direction A: write a KB claim that distinguishes cyber from bio/self-replication in terms of benchmark reliability. Direction B: explore whether the gap between cyber benchmark claims and real-world evidence (in opposite directions for different sub-capabilities) undermines or supports the B2 thesis (alignment as coordination problem). Direction A first.

18 KiB Raw Blame History