pipeline: archive 1 source(s) post-merge
Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70>
This commit is contained in:
parent
78181f5212
commit
47a678f972
1 changed files with 64 additions and 0 deletions
|
|
@ -0,0 +1,64 @@
|
|||
---
|
||||
type: source
|
||||
title: "AI Self-Replication Roundup — No End-to-End Evaluation Exists Under Realistic Security Conditions"
|
||||
author: "Bradford Saad (@BradfordSaad)"
|
||||
url: https://meditationsondigitalminds.substack.com/p/ai-self-replication-roundup
|
||||
date: 2025-10-01
|
||||
domain: ai-alignment
|
||||
secondary_domains: []
|
||||
format: blog-post
|
||||
status: processed
|
||||
priority: medium
|
||||
tags: [self-replication, replibench, end-to-end-evaluation, Pan-et-al, SOCK-benchmark, Google-DeepMind, security-conditions]
|
||||
---
|
||||
|
||||
## Content
|
||||
|
||||
A comprehensive roundup of all published self-replication evaluations for frontier AI models, examining what has and hasn't been demonstrated. Saad reviews RepliBench, Pan et al., SOCK benchmark, and Google DeepMind's evaluation.
|
||||
|
||||
**RepliBench (UK AISI, spring 2025)**:
|
||||
- Claude 3.7 achieved >50% probability on 15/20 task families
|
||||
- Models have "access to their own weights" — making exfiltration artificially easier than real frontier scenario
|
||||
- End-to-end evaluation not included: "RepliBench isn't an end-to-end evaluation"
|
||||
- AISI's own disclaimer noted: success on components ≠ real-world replication capability
|
||||
|
||||
**Pan et al. (2024/2025)**:
|
||||
- Claimed open-weights models achieved self-replication
|
||||
- Critical limitation: did NOT include weight exfiltration requirements
|
||||
- Not comparable to real frontier scenario where weights are proprietary
|
||||
|
||||
**SOCK benchmark (September 2025)**:
|
||||
- Found successful completions on frontier models tested
|
||||
- "Broadly aligned with RepliBench findings" — same component-task approach
|
||||
|
||||
**Google DeepMind evaluation**:
|
||||
- Models "largely failed to autonomously complete" 11 end-to-end tasks
|
||||
- Showed "proximity to success" — came close but didn't achieve end-to-end capability
|
||||
- The most rigorous attempt at end-to-end testing
|
||||
|
||||
**No evaluation has achieved**: True end-to-end closed-model self-replication under realistic security conditions. All evaluations either: (a) use simulated environments, (b) skip weight exfiltration, (c) allow unrealistic access to own weights, or (d) find failure at the end-to-end level.
|
||||
|
||||
**Saad's assessment**: Component success rates signal "near-term governance necessity rather than immediate crisis." Advocates defense-in-depth. Critiques deprioritization of self-replication risk relative to research allocation. Trajectory concern: "given rapidly improving performance, autonomous replication capability could soon emerge."
|
||||
|
||||
**Key remaining gaps in real-world capability**: Reliable weight replication and operation under realistic security constraints are the hardest unsolved steps.
|
||||
|
||||
## Agent Notes
|
||||
|
||||
**Why this matters:** Confirms the finding that the >60% AISI Trends Report figure has no corresponding end-to-end evaluation. Multiple independent evaluations exist but none have achieved end-to-end replication under realistic conditions. This closes the search for contradictory evidence — no evaluation contradicts AISI's own disclaimer that these metrics don't imply real-world capability.
|
||||
|
||||
**What surprised me:** Google DeepMind's 11-task end-to-end evaluation is the most rigorous attempt, and models "largely failed" while showing "proximity to success." This is the clearest data point on the gap between component capability (60%+) and end-to-end capability (failing 11 tasks). The proximity finding is what makes the trajectory argument compelling — close enough to succeed soon.
|
||||
|
||||
**What I expected but didn't find:** Any independent estimate of the gap magnitude between component benchmark success and end-to-end real-world capability. No one has quantified "60% components → X% end-to-end under real conditions." The gap exists but its size is unknown.
|
||||
|
||||
**KB connections:**
|
||||
- [[AGI may emerge as a patchwork of coordinating sub-AGI agents rather than a single monolithic system]] — self-replication is the mechanism for patchwork coordination; the component task gaps show this is further than benchmarks imply
|
||||
- [[three conditions gate AI takeover risk autonomy robotics and production chain control]] — self-replication capability is one of the takeover conditions; RepliBench data shows this condition is not yet met at operational level despite high component scores
|
||||
|
||||
**Extraction hints:**
|
||||
1. "No evaluation has achieved end-to-end closed-model self-replication under realistic security conditions despite component task success rates above 60%, because all evaluations use simulated environments, skip weight exfiltration, or allow unrealistic weight access" — strong scope-qualifying claim
|
||||
2. The Google DeepMind finding (failing 11 end-to-end tasks while showing proximity) is the most useful data point — consider whether this warrants its own source file for the DeepMind evaluation specifically
|
||||
|
||||
## Curator Notes (structured handoff for extractor)
|
||||
PRIMARY CONNECTION: [[three conditions gate AI takeover risk autonomy robotics and production chain control and current AI satisfies none of them]] — this roundup provides updated evidence that the autonomy condition (self-replication) remains unmet operationally despite high component benchmark scores
|
||||
WHY ARCHIVED: Closes the loop on the self-replication benchmark-reality gap; confirms that the absence of end-to-end evaluations is comprehensive, not accidental
|
||||
EXTRACTION HINT: The extractor should check the existing [[three conditions gate AI takeover risk]] claim — it may need updating with the Google DeepMind end-to-end failure data. Also check [[instrumental convergence risks may be less imminent than originally argued]] — this roundup is additional evidence for that claim's experimental confidence rating.
|
||||
Loading…
Reference in a new issue