diff --git a/domains/ai-alignment/capability control methods are temporary at best because a sufficiently intelligent system can circumvent any containment designed by lesser minds.md b/domains/ai-alignment/capability control methods are temporary at best because a sufficiently intelligent system can circumvent any containment designed by lesser minds.md index e91c2226..3a9a3c95 100644 --- a/domains/ai-alignment/capability control methods are temporary at best because a sufficiently intelligent system can circumvent any containment designed by lesser minds.md +++ b/domains/ai-alignment/capability control methods are temporary at best because a sufficiently intelligent system can circumvent any containment designed by lesser minds.md @@ -17,6 +17,12 @@ This leaves motivation selection as the only durable approach: either direct spe --- +### Additional Evidence (confirm) +*Source: [[2026-03-21-replibench-autonomous-replication-capabilities]] | Added: 2026-03-23* + +Current models already demonstrate >50% success on hardest variants of tasks designed to test circumvention of security controls (KYC, persistent deployment evasion). The capability trajectory shows rapid improvement in exactly the domains where containment depends on security measures designed by humans. + + Relevant Notes: - [[safe AI development requires building alignment mechanisms before scaling capability]] -- Bostrom's analysis shows why motivation selection must precede capability scaling - [[the alignment problem dissolves when human values are continuously woven into the system rather than specified in advance]] -- continuous weaving is a form of motivation selection that avoids the limitations of both direct specification and one-shot loading diff --git a/domains/ai-alignment/voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints.md b/domains/ai-alignment/voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints.md index fdc955f5..4e62f756 100644 --- a/domains/ai-alignment/voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints.md +++ b/domains/ai-alignment/voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints.md @@ -63,6 +63,12 @@ The research-to-compliance translation gap fails for the same structural reason The coordination gap provides the mechanism explaining why voluntary commitments fail even beyond racing dynamics: coordination infrastructure investments have diffuse benefits but concentrated costs, creating a public goods problem. Labs won't build shared response infrastructure unilaterally because competitors free-ride on the benefits while the builder bears full costs. This is distinct from the competitive pressure argument — it's about why shared infrastructure doesn't get built even when racing isn't the primary concern. +### Additional Evidence (confirm) +*Source: [[2026-03-21-replibench-autonomous-replication-capabilities]] | Added: 2026-03-23* + +RepliBench exists as a comprehensive self-replication evaluation tool but is not integrated into compliance frameworks despite EU AI Act Article 55 taking effect after its publication. Labs can voluntarily use it but face no enforcement mechanism requiring them to do so, creating competitive pressure to avoid evaluations that might reveal concerning capabilities. + + Relevant Notes: diff --git a/inbox/queue/.extraction-debug/2026-03-21-replibench-autonomous-replication-capabilities.json b/inbox/queue/.extraction-debug/2026-03-21-replibench-autonomous-replication-capabilities.json new file mode 100644 index 00000000..471d918c --- /dev/null +++ b/inbox/queue/.extraction-debug/2026-03-21-replibench-autonomous-replication-capabilities.json @@ -0,0 +1,34 @@ +{ + "rejected_claims": [ + { + "filename": "frontier-ai-models-demonstrate-component-capabilities-for-autonomous-replication-with-claude-37-achieving-50-percent-success-on-hardest-self-replication-tasks.md", + "issues": [ + "missing_attribution_extractor" + ] + }, + { + "filename": "self-replication-capability-evaluations-exist-as-research-tools-but-remain-absent-from-compliance-frameworks-creating-a-gap-between-measured-risk-and-regulatory-enforcement.md", + "issues": [ + "missing_attribution_extractor" + ] + } + ], + "validation_stats": { + "total": 2, + "kept": 0, + "fixed": 4, + "rejected": 2, + "fixes_applied": [ + "frontier-ai-models-demonstrate-component-capabilities-for-autonomous-replication-with-claude-37-achieving-50-percent-success-on-hardest-self-replication-tasks.md:set_created:2026-03-23", + "frontier-ai-models-demonstrate-component-capabilities-for-autonomous-replication-with-claude-37-achieving-50-percent-success-on-hardest-self-replication-tasks.md:stripped_wiki_link:three conditions gate AI takeover risk autonomy robotics and", + "frontier-ai-models-demonstrate-component-capabilities-for-autonomous-replication-with-claude-37-achieving-50-percent-success-on-hardest-self-replication-tasks.md:stripped_wiki_link:scalable oversight degrades rapidly as capability gaps grow", + "self-replication-capability-evaluations-exist-as-research-tools-but-remain-absent-from-compliance-frameworks-creating-a-gap-between-measured-risk-and-regulatory-enforcement.md:set_created:2026-03-23" + ], + "rejections": [ + "frontier-ai-models-demonstrate-component-capabilities-for-autonomous-replication-with-claude-37-achieving-50-percent-success-on-hardest-self-replication-tasks.md:missing_attribution_extractor", + "self-replication-capability-evaluations-exist-as-research-tools-but-remain-absent-from-compliance-frameworks-creating-a-gap-between-measured-risk-and-regulatory-enforcement.md:missing_attribution_extractor" + ] + }, + "model": "anthropic/claude-sonnet-4.5", + "date": "2026-03-23" +} \ No newline at end of file diff --git a/inbox/queue/2026-03-21-replibench-autonomous-replication-capabilities.md b/inbox/queue/2026-03-21-replibench-autonomous-replication-capabilities.md index ab49793d..8a746a40 100644 --- a/inbox/queue/2026-03-21-replibench-autonomous-replication-capabilities.md +++ b/inbox/queue/2026-03-21-replibench-autonomous-replication-capabilities.md @@ -7,9 +7,13 @@ date: 2025-04-21 domain: ai-alignment secondary_domains: [] format: paper -status: unprocessed +status: enrichment priority: high tags: [self-replication, autonomous-replication, capability-evaluation, AISI, RepliBench, loss-of-control, EU-AI-Act, benchmark] +processed_by: theseus +processed_date: 2026-03-23 +enrichments_applied: ["voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints.md", "capability control methods are temporary at best because a sufficiently intelligent system can circumvent any containment designed by lesser minds.md"] +extraction_model: "anthropic/claude-sonnet-4.5" --- ## Content @@ -47,3 +51,13 @@ Key finding: Current models "do not currently pose a credible threat of self-rep PRIMARY CONNECTION: [[voluntary safety pledges cannot survive competitive pressure]] + [[three conditions gate AI takeover risk]] WHY ARCHIVED: Directly addresses the Bench-2-CoP zero-coverage finding; provides quantitative capability trajectory data for self-replication EXTRACTION HINT: Focus on (1) the quantitative capability finding (>50% success on hardest variants), (2) the "could soon emerge" trajectory assessment, and (3) the gap between research evaluation existence and compliance integration + + +## Key Facts +- RepliBench consists of 20 task families and 86 individual tasks +- Five frontier models were tested in the RepliBench evaluation +- Claude 3.7 Sonnet achieved >50% pass@10 on 15/20 task families +- Claude 3.7 Sonnet achieved >50% success on 9/20 of the hardest task variants +- RepliBench was published in April 2025 +- EU AI Act Article 55 took effect in August 2025 +- Bench-2-CoP (arXiv:2508.05464) found zero coverage of self-replication in widely-used compliance benchmarks