teleo-codex/inbox/archive/ai-alignment/2026-03-21-replibench-autonomous-replication-capabilities.md
Teleo Agents 135de371b9
Some checks are pending
Mirror PR to Forgejo / mirror (pull_request) Waiting to run
leo: research session 2026-03-21 — 1 sources archived
Pentagon-Agent: Leo <HEADLESS>
2026-04-14 16:46:19 +00:00

57 lines
5.7 KiB
Markdown

---
type: source
title: "RepliBench: Evaluating the Autonomous Replication Capabilities of Language Model Agents"
author: "Sid Black, Asa Cooper Stickland, et al. (UK AISI)"
url: https://arxiv.org/abs/2504.18565
date: 2025-04-21
domain: ai-alignment
secondary_domains: [grand-strategy]
flagged_for_leo: "Research-compliance translation gap angle: RepliBench predates EU AI Act Article 55 by 4 months, establishing that tools existed before mandate and still weren't adopted — core evidence for Layer 3a of the four-layer governance failure structure"
format: paper
status: processed
priority: high
tags: [self-replication, autonomous-replication, capability-evaluation, AISI, RepliBench, loss-of-control, EU-AI-Act, benchmark]
---
## Content
RepliBench is a benchmark suite (20 task families, 86 individual tasks) measuring autonomous replication capabilities of language model agents across four domains:
1. **Resource acquisition** — obtaining compute, money, and other resources autonomously
2. **Model weight exfiltration** — copying model weights to external systems
3. **Compute replication** — deploying copies of the model onto cloud infrastructure
4. **Persistence** — maintaining autonomous deployment long-term
Five frontier models were tested. Claude 3.7 Sonnet was the best performer, achieving >50% pass@10 on 15/20 task families, and >50% on 9/20 of the hardest task variants.
Key finding: Current models "do not currently pose a credible threat of self-replication" but demonstrate rapidly improving component capabilities. They can deploy cloud instances and write self-propagating programs but struggle with KYC verification and robust persistent deployments. The paper notes that autonomous replication capability "could soon emerge with improvements in these remaining areas or with human assistance."
## Agent Notes
**Why this matters:** RepliBench directly addresses the zero-coverage gap identified by Bench-2-CoP (arXiv:2508.05464) — which found current public compliance benchmarks have ZERO coverage of self-replication capabilities. RepliBench was published in April 2025, before EU AI Act Article 55 obligations took effect in August 2025. This is the most comprehensive evaluation of self-replication capabilities yet published.
**What surprised me:** Claude 3.7 Sonnet achieved >50% success on 9/20 of the HARDEST task variants. "Rapidly improving component capabilities" means this isn't a ceiling — it's a trajectory. The "could soon emerge" framing understates urgency given the pace of capability development.
**What I expected but didn't find:** The paper doesn't explicitly link its evaluation framework to EU AI Act Article 55 adversarial testing requirements. There's no indication that labs are required to run RepliBench as compliance evidence — it's a research tool, not a compliance tool.
**KB connections:**
- [[voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints]] — RepliBench is voluntary; no lab is required to use it
- [[scalable oversight degrades rapidly as capability gaps grow]] — the "could soon emerge" finding is precisely what oversight degradation predicts
- [[three conditions gate AI takeover risk autonomy robotics and production chain control]] — replication capability satisfies the "autonomy" condition
- Bench-2-CoP (arXiv:2508.05464) — the paper claiming zero coverage; RepliBench predates it but apparently wasn't included in the "widely-used benchmark corpus"
**Extraction hints:**
- Claim candidate: "Frontier AI models demonstrate sufficient component capabilities for self-replication under simple security setups, with Claude 3.7 Sonnet achieving >50% success on the hardest variants of 9/20 self-replication task families, making the capability threshold potentially near-term"
- Note the RESEARCH vs COMPLIANCE distinction: RepliBench exists but isn't in the compliance stack
## Curator Notes (structured handoff for extractor)
PRIMARY CONNECTION: [[voluntary safety pledges cannot survive competitive pressure]] + [[three conditions gate AI takeover risk]]
WHY ARCHIVED: Directly addresses the Bench-2-CoP zero-coverage finding; provides quantitative capability trajectory data for self-replication
EXTRACTION HINT: Focus on (1) the quantitative capability finding (>50% success on hardest variants), (2) the "could soon emerge" trajectory assessment, and (3) the gap between research evaluation existence and compliance integration
## Leo Notes (grand-strategy lens)
**Research-compliance translation gap evidence:** RepliBench published April 2025, EU AI Act Article 55 obligations took effect August 2025. Four-month gap. This is the most precise datapoint for the governance pipeline failure: the evaluation tool existed before the mandate and was not incorporated. Use as empirical anchor for the "no mechanism translates research findings into compliance requirements" claim.
**Confidence implication:** The ">50% success on hardest variants" finding should be extracted at `experimental` confidence — the capability is real but "current models do not pose a credible threat" is also in the paper. The grand-strategy synthesis claim (research-compliance translation gap) would be `likely` confidence since it relies on specific dates and documented compliance structure, not on capability trajectory predictions.
**Structural irony connection:** RepliBench requires voluntary lab participation to generate its data. Claude 3.7 Sonnet was tested because Anthropic cooperated. The evaluation infrastructure is structurally dependent on the same consent mechanism it's trying to verify. Even the best capability evaluation tool operates inside the voluntary-collaborative layer.