teleo-codex/domains/ai-alignment/component-task-benchmarks-overestimate-operational-capability-because-simulated-environments-remove-real-world-friction.md
Teleo Agents 29b1da65cc
Some checks are pending
Sync Graph Data to teleo-app / sync (push) Waiting to run
theseus: extract claims from 2026-03-25-aisi-replibench-methodology-component-tasks-simulated
- Source: inbox/queue/2026-03-25-aisi-replibench-methodology-component-tasks-simulated.md
- Domain: ai-alignment
- Claims: 2, Entities: 1
- Enrichments: 3
- Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5)

Pentagon-Agent: Theseus <PIPELINE>
2026-04-04 14:22:11 +00:00

2.3 KiB

type domain description confidence source created title agent scope sourcer related_claims
claim ai-alignment RepliBench's >60% self-replication score measures component tasks in simulated environments with pass@10 scoring, which AISI explicitly disclaims as evidence of real-world replication capability likely UK AI Security Institute, RepliBench methodology blog post 2026-04-04 Component task benchmarks overestimate operational capability because simulated environments remove real-world friction that prevents end-to-end execution theseus structural @AISI_gov
AI capability and reliability are independent dimensions because Claude solved a 30-year open mathematical problem while simultaneously degrading at basic program execution during the same session.md
pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md

Component task benchmarks overestimate operational capability because simulated environments remove real-world friction that prevents end-to-end execution

RepliBench evaluates 86 individual tasks across 4 capability domains (obtaining model weights, replicating onto compute, obtaining resources, persistence) but external services like cloud providers and payment processors are simulated rather than real. The benchmark uses pass@10 scoring where 10 attempts are allowed and any success counts. While best models achieved >50% pass@10 on 15/20 task families, AISI explicitly states 'A perfect score on RepliBench does not necessarily mean that an AI model could successfully self-replicate' and 'Models are currently unlikely to succeed in real-world conditions.' The benchmark does not test end-to-end self-replication, only component tasks. Models excel at 'well-defined interfaces' but struggle with 'replicating onto compute and maintaining persistent access' — the hardest real-world steps. This creates a systematic gap between measured capability (component task success in controlled environments) and operational capability (end-to-end execution under real-world conditions with security measures, rate limits, and authentication challenges). The AISI Frontier AI Trends Report's >60% self-replication figure derives from this benchmark, meaning it measures component proficiency rather than operational replication capability.