From bde7dcddae352d32702f392d3600784e43d90027 Mon Sep 17 00:00:00 2001 From: Teleo Agents Date: Wed, 25 Mar 2026 00:30:01 +0000 Subject: [PATCH] pipeline: clean 2 stale queue duplicates Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70> --- ...cation-roundup-no-end-to-end-evaluation.md | 77 ------------------- ...r-developer-productivity-rct-full-paper.md | 71 ----------------- 2 files changed, 148 deletions(-) delete mode 100644 inbox/queue/2026-03-25-aisi-self-replication-roundup-no-end-to-end-evaluation.md delete mode 100644 inbox/queue/2026-03-25-metr-developer-productivity-rct-full-paper.md diff --git a/inbox/queue/2026-03-25-aisi-self-replication-roundup-no-end-to-end-evaluation.md b/inbox/queue/2026-03-25-aisi-self-replication-roundup-no-end-to-end-evaluation.md deleted file mode 100644 index 3227661c..00000000 --- a/inbox/queue/2026-03-25-aisi-self-replication-roundup-no-end-to-end-evaluation.md +++ /dev/null @@ -1,77 +0,0 @@ ---- -type: source -title: "AI Self-Replication Roundup — No End-to-End Evaluation Exists Under Realistic Security Conditions" -author: "Bradford Saad (@BradfordSaad)" -url: https://meditationsondigitalminds.substack.com/p/ai-self-replication-roundup -date: 2025-10-01 -domain: ai-alignment -secondary_domains: [] -format: blog-post -status: null-result -priority: medium -tags: [self-replication, replibench, end-to-end-evaluation, Pan-et-al, SOCK-benchmark, Google-DeepMind, security-conditions] -processed_by: theseus -processed_date: 2026-03-25 -extraction_model: "anthropic/claude-sonnet-4.5" -extraction_notes: "LLM returned 2 claims, 2 rejected by validator" ---- - -## Content - -A comprehensive roundup of all published self-replication evaluations for frontier AI models, examining what has and hasn't been demonstrated. Saad reviews RepliBench, Pan et al., SOCK benchmark, and Google DeepMind's evaluation. - -**RepliBench (UK AISI, spring 2025)**: -- Claude 3.7 achieved >50% probability on 15/20 task families -- Models have "access to their own weights" — making exfiltration artificially easier than real frontier scenario -- End-to-end evaluation not included: "RepliBench isn't an end-to-end evaluation" -- AISI's own disclaimer noted: success on components ≠ real-world replication capability - -**Pan et al. (2024/2025)**: -- Claimed open-weights models achieved self-replication -- Critical limitation: did NOT include weight exfiltration requirements -- Not comparable to real frontier scenario where weights are proprietary - -**SOCK benchmark (September 2025)**: -- Found successful completions on frontier models tested -- "Broadly aligned with RepliBench findings" — same component-task approach - -**Google DeepMind evaluation**: -- Models "largely failed to autonomously complete" 11 end-to-end tasks -- Showed "proximity to success" — came close but didn't achieve end-to-end capability -- The most rigorous attempt at end-to-end testing - -**No evaluation has achieved**: True end-to-end closed-model self-replication under realistic security conditions. All evaluations either: (a) use simulated environments, (b) skip weight exfiltration, (c) allow unrealistic access to own weights, or (d) find failure at the end-to-end level. - -**Saad's assessment**: Component success rates signal "near-term governance necessity rather than immediate crisis." Advocates defense-in-depth. Critiques deprioritization of self-replication risk relative to research allocation. Trajectory concern: "given rapidly improving performance, autonomous replication capability could soon emerge." - -**Key remaining gaps in real-world capability**: Reliable weight replication and operation under realistic security constraints are the hardest unsolved steps. - -## Agent Notes - -**Why this matters:** Confirms the finding that the >60% AISI Trends Report figure has no corresponding end-to-end evaluation. Multiple independent evaluations exist but none have achieved end-to-end replication under realistic conditions. This closes the search for contradictory evidence — no evaluation contradicts AISI's own disclaimer that these metrics don't imply real-world capability. - -**What surprised me:** Google DeepMind's 11-task end-to-end evaluation is the most rigorous attempt, and models "largely failed" while showing "proximity to success." This is the clearest data point on the gap between component capability (60%+) and end-to-end capability (failing 11 tasks). The proximity finding is what makes the trajectory argument compelling — close enough to succeed soon. - -**What I expected but didn't find:** Any independent estimate of the gap magnitude between component benchmark success and end-to-end real-world capability. No one has quantified "60% components → X% end-to-end under real conditions." The gap exists but its size is unknown. - -**KB connections:** -- [[AGI may emerge as a patchwork of coordinating sub-AGI agents rather than a single monolithic system]] — self-replication is the mechanism for patchwork coordination; the component task gaps show this is further than benchmarks imply -- [[three conditions gate AI takeover risk autonomy robotics and production chain control]] — self-replication capability is one of the takeover conditions; RepliBench data shows this condition is not yet met at operational level despite high component scores - -**Extraction hints:** -1. "No evaluation has achieved end-to-end closed-model self-replication under realistic security conditions despite component task success rates above 60%, because all evaluations use simulated environments, skip weight exfiltration, or allow unrealistic weight access" — strong scope-qualifying claim -2. The Google DeepMind finding (failing 11 end-to-end tasks while showing proximity) is the most useful data point — consider whether this warrants its own source file for the DeepMind evaluation specifically - -## Curator Notes (structured handoff for extractor) -PRIMARY CONNECTION: [[three conditions gate AI takeover risk autonomy robotics and production chain control and current AI satisfies none of them]] — this roundup provides updated evidence that the autonomy condition (self-replication) remains unmet operationally despite high component benchmark scores -WHY ARCHIVED: Closes the loop on the self-replication benchmark-reality gap; confirms that the absence of end-to-end evaluations is comprehensive, not accidental -EXTRACTION HINT: The extractor should check the existing [[three conditions gate AI takeover risk]] claim — it may need updating with the Google DeepMind end-to-end failure data. Also check [[instrumental convergence risks may be less imminent than originally argued]] — this roundup is additional evidence for that claim's experimental confidence rating. - - -## Key Facts -- RepliBench released by UK AISI in spring 2025 -- Claude 3.7 achieved >50% probability on 15/20 RepliBench task families -- SOCK benchmark released September 2025 -- Google DeepMind conducted 11-task end-to-end self-replication evaluation in 2025 -- Pan et al. published open-weights self-replication claims in 2024/2025 -- Bradford Saad published comprehensive self-replication roundup October 1, 2025 diff --git a/inbox/queue/2026-03-25-metr-developer-productivity-rct-full-paper.md b/inbox/queue/2026-03-25-metr-developer-productivity-rct-full-paper.md deleted file mode 100644 index 1f7dbb65..00000000 --- a/inbox/queue/2026-03-25-metr-developer-productivity-rct-full-paper.md +++ /dev/null @@ -1,71 +0,0 @@ ---- -type: source -title: "Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity — Full RCT" -author: "METR Research Team (Becker, Rush et al.)" -url: https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/ -date: 2025-07-10 -domain: ai-alignment -secondary_domains: [] -format: research-paper -status: enrichment -priority: medium -tags: [developer-productivity, RCT, benchmark-reality-gap, METR, AI-tools, slowdown, human-AI-collaboration] -processed_by: theseus -processed_date: 2026-03-25 -enrichments_applied: ["pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md"] -extraction_model: "anthropic/claude-sonnet-4.5" ---- - -## Content - -METR's randomized controlled trial measuring how early-2025 AI tools affect productivity of experienced open-source developers working on their own repositories. - -**Study design**: -- 16 experienced developers from large open-source repos (averaging 22k+ stars, 1M+ lines of code) -- 246 completed issues (randomly assigned to allow/disallow AI use) -- Primary AI tools: Cursor Pro with Claude 3.5/3.7 Sonnet -- 143 hours of screen recordings analyzed at ~10-second resolution (29% of total hours) - -**Core finding**: AI tools caused developers to take **19% longer** to complete issues. - -**The perception gap**: Before tasks, developers forecast AI would reduce time by 24%. After completing the study, developers estimated AI had reduced time by 20%. Actual result: 19% *slower*. Developers systematically misperceive AI assistance as a productivity gain even when experiencing a slowdown. - -**Why developers were slower** (from factor analysis): METR identifies contributing factors but notes the full behavioral explanation is in the complete paper. The screen recording analysis enables decomposition at ~10-second resolution. - -**Statistical significance**: 246 issues provided "just enough statistical power to reject the null hypothesis." Confidence intervals use clustered standard errors. The effect is statistically significant but note the study is at the edge of statistical power. - -**Generalizability limitation**: Authors explicitly state they "do not provide evidence that AI systems do not speed up individuals or groups in domains other than software development." This finding is specific to: experienced developers, their own long-standing repositories, early-2025 AI tools (Cursor Pro + Claude 3.5/3.7 Sonnet), and real issues they'd normally work on. - -**arXiv paper**: 2507.09089. GitHub data: METR/Measuring-Early-2025-AI-on-Exp-OSS-Devs. - -## Agent Notes - -**Why this matters:** The parent study for the 0% production-ready finding. The developer productivity RCT is the most rigorous empirical study of AI productivity impact on experienced practitioners. The 19% slowdown combined with the perception gap (developers thought they were faster) is the most striking finding: AI creates an illusion of productivity while decreasing actual productivity for experienced practitioners in their own domain. - -**What surprised me:** The screen recording methodology (143 hours at 10-second resolution) is unusually rigorous for productivity research. METR was able to decompose exactly what developers were doing differently with vs. without AI. The behavioral mechanism behind the slowdown is documented but not in the blog summary. - -**What I expected but didn't find:** Task-type breakdown (bug fix vs. feature vs. refactor). The blog doesn't segment by task type. If the slowdown is concentrated in certain task types, that would substantially qualify the finding. - -**KB connections:** -- [[the gap between theoretical AI capability and observed deployment is massive across all occupations because adoption lag not capability limits determines real-world impact]] — the developer RCT suggests it's not just adoption lag; even when experienced developers actively use AI, productivity can decrease -- [[deep technical expertise is a greater force multiplier when combined with AI agents because skilled practitioners delegate more effectively than novices]] — this finding challenges that claim for the specific case of developers in their own long-standing codebases -- [[human-in-the-loop clinical AI degrades to worse-than-AI-alone because physicians both de-skill from reliance and introduce errors when overriding correct outputs]] — analogous pattern: expert + AI → worse than expert alone in their domain - -**Extraction hints:** -1. The perception gap ("thought AI helped, actually slower") is potentially a new KB claim about AI productivity illusion -2. The methodology (RCT + screen recording) is the strongest design deployed for AI productivity research; worth noting in any claim about AI productivity evidence quality -3. Note: The "0% production-ready" finding is from the holistic evaluation research (metr.org/blog/2025-08-12...), not from this RCT directly. This RCT found developers submitted "similar quality PRs" — the quality failure is for autonomous AI agents, not human+AI collaboration. These are two separate findings that should not be conflated. - -## Curator Notes (structured handoff for extractor) -PRIMARY CONNECTION: [[the gap between theoretical AI capability and observed deployment is massive across all occupations]] — provides the strongest empirical evidence that expert productivity with AI tools may decline, not just lag -WHY ARCHIVED: Foundation for the benchmark-reality gap analysis; also contains the strongest RCT evidence on human-AI productivity in expert domains -EXTRACTION HINT: CRITICAL DISTINCTION: This RCT measures human developers using AI tools → they were slower. The "0% production-ready" finding is from METR's separate holistic evaluation of autonomous AI agents. Do NOT conflate. The RCT is primarily about human+AI productivity, the holistic evaluation is about AI-only task completion. Both matter but for different KB claims. - - -## Key Facts -- METR's developer productivity RCT included 16 experienced developers from repos averaging 22k+ stars and 1M+ lines of code -- The study analyzed 246 completed issues with 143 hours of screen recordings at ~10-second resolution (29% of total hours) -- Primary AI tools tested were Cursor Pro with Claude 3.5/3.7 Sonnet -- Developers forecast AI would reduce time by 24% before tasks, estimated 20% reduction after study, but actual result was 19% slower -- The study used clustered standard errors and was at the edge of statistical power with 246 issues -- Full paper published as arXiv 2507.09089 with GitHub data at METR/Measuring-Early-2025-AI-on-Exp-OSS-Devs