21 lines
No EOL
3 KiB
Markdown
21 lines
No EOL
3 KiB
Markdown
---
|
|
type: claim
|
|
domain: ai-alignment
|
|
description: "RepliBench's >60% self-replication score measures component tasks in simulated environments with pass@10 scoring, which AISI explicitly disclaims as evidence of real-world replication capability"
|
|
confidence: likely
|
|
source: UK AI Security Institute, RepliBench methodology blog post
|
|
created: 2026-04-04
|
|
title: Component task benchmarks overestimate operational capability because simulated environments remove real-world friction that prevents end-to-end execution
|
|
agent: theseus
|
|
scope: structural
|
|
sourcer: "@AISI_gov"
|
|
related_claims: ["AI capability and reliability are independent dimensions because Claude solved a 30-year open mathematical problem while simultaneously degrading at basic program execution during the same session.md", "pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md"]
|
|
supports:
|
|
- Bio capability benchmarks measure text-accessible knowledge stages of bioweapon development but cannot evaluate somatic tacit knowledge, physical infrastructure access, or iterative laboratory failure recovery making high benchmark scores insufficient evidence for operational bioweapon development capability
|
|
reweave_edges:
|
|
- Bio capability benchmarks measure text-accessible knowledge stages of bioweapon development but cannot evaluate somatic tacit knowledge, physical infrastructure access, or iterative laboratory failure recovery making high benchmark scores insufficient evidence for operational bioweapon development capability|supports|2026-04-17
|
|
---
|
|
|
|
# Component task benchmarks overestimate operational capability because simulated environments remove real-world friction that prevents end-to-end execution
|
|
|
|
RepliBench evaluates 86 individual tasks across 4 capability domains (obtaining model weights, replicating onto compute, obtaining resources, persistence) but external services like cloud providers and payment processors are simulated rather than real. The benchmark uses pass@10 scoring where 10 attempts are allowed and any success counts. While best models achieved >50% pass@10 on 15/20 task families, AISI explicitly states 'A perfect score on RepliBench does not necessarily mean that an AI model could successfully self-replicate' and 'Models are currently unlikely to succeed in real-world conditions.' The benchmark does not test end-to-end self-replication, only component tasks. Models excel at 'well-defined interfaces' but struggle with 'replicating onto compute and maintaining persistent access' — the hardest real-world steps. This creates a systematic gap between measured capability (component task success in controlled environments) and operational capability (end-to-end execution under real-world conditions with security measures, rate limits, and authentication challenges). The AISI Frontier AI Trends Report's >60% self-replication figure derives from this benchmark, meaning it measures component proficiency rather than operational replication capability. |