diff --git a/domains/ai-alignment/capability control methods are temporary at best because a sufficiently intelligent system can circumvent any containment designed by lesser minds.md b/domains/ai-alignment/capability control methods are temporary at best because a sufficiently intelligent system can circumvent any containment designed by lesser minds.md index e91c2226c..3a9a3c95e 100644 --- a/domains/ai-alignment/capability control methods are temporary at best because a sufficiently intelligent system can circumvent any containment designed by lesser minds.md +++ b/domains/ai-alignment/capability control methods are temporary at best because a sufficiently intelligent system can circumvent any containment designed by lesser minds.md @@ -17,6 +17,12 @@ This leaves motivation selection as the only durable approach: either direct spe --- +### Additional Evidence (confirm) +*Source: [[2026-03-21-replibench-autonomous-replication-capabilities]] | Added: 2026-03-23* + +Current models already demonstrate >50% success on hardest variants of tasks designed to test circumvention of security controls (KYC, persistent deployment evasion). The capability trajectory shows rapid improvement in exactly the domains where containment depends on security measures designed by humans. + + Relevant Notes: - [[safe AI development requires building alignment mechanisms before scaling capability]] -- Bostrom's analysis shows why motivation selection must precede capability scaling - [[the alignment problem dissolves when human values are continuously woven into the system rather than specified in advance]] -- continuous weaving is a form of motivation selection that avoids the limitations of both direct specification and one-shot loading diff --git a/domains/ai-alignment/voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints.md b/domains/ai-alignment/voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints.md index fdc955f59..4e62f756e 100644 --- a/domains/ai-alignment/voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints.md +++ b/domains/ai-alignment/voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints.md @@ -63,6 +63,12 @@ The research-to-compliance translation gap fails for the same structural reason The coordination gap provides the mechanism explaining why voluntary commitments fail even beyond racing dynamics: coordination infrastructure investments have diffuse benefits but concentrated costs, creating a public goods problem. Labs won't build shared response infrastructure unilaterally because competitors free-ride on the benefits while the builder bears full costs. This is distinct from the competitive pressure argument — it's about why shared infrastructure doesn't get built even when racing isn't the primary concern. +### Additional Evidence (confirm) +*Source: [[2026-03-21-replibench-autonomous-replication-capabilities]] | Added: 2026-03-23* + +RepliBench exists as a comprehensive self-replication evaluation tool but is not integrated into compliance frameworks despite EU AI Act Article 55 taking effect after its publication. Labs can voluntarily use it but face no enforcement mechanism requiring them to do so, creating competitive pressure to avoid evaluations that might reveal concerning capabilities. + + Relevant Notes: diff --git a/domains/health/human-in-the-loop clinical AI degrades to worse-than-AI-alone because physicians both de-skill from reliance and introduce errors when overriding correct outputs.md b/domains/health/human-in-the-loop clinical AI degrades to worse-than-AI-alone because physicians both de-skill from reliance and introduce errors when overriding correct outputs.md index 57e41718a..2cadc630e 100644 --- a/domains/health/human-in-the-loop clinical AI degrades to worse-than-AI-alone because physicians both de-skill from reliance and introduce errors when overriding correct outputs.md +++ b/domains/health/human-in-the-loop clinical AI degrades to worse-than-AI-alone because physicians both de-skill from reliance and introduce errors when overriding correct outputs.md @@ -48,6 +48,12 @@ The Klang et al. Lancet Digital Health study (February 2026) adds a fourth failu NCT07328815 tests whether a UI-layer behavioral nudge (ensemble-LLM confidence signals + anchoring cues) can mitigate automation bias where training failed. The parent study (NCT06963957) showed 20-hour AI-literacy training did not prevent automation bias. This trial operationalizes a structural solution: using multi-model disagreement as an automatic uncertainty flag that doesn't require physician understanding of model internals. Results pending (2026). +### Additional Evidence (extend) +*Source: [[2026-03-22-automation-bias-rct-ai-trained-physicians]] | Added: 2026-03-23* + +RCT evidence (NCT06963957, medRxiv August 2025) shows automation bias persists even after 20 hours of AI-literacy training specifically designed to teach critical evaluation of AI output. Physicians with this training still voluntarily deferred to deliberately erroneous LLM recommendations in 3 of 6 clinical vignettes, demonstrating that the human-in-the-loop degradation mechanism operates even when humans are extensively trained to resist it. + + diff --git a/inbox/archive/ai-alignment/2026-03-21-replibench-autonomous-replication-capabilities.md b/inbox/archive/ai-alignment/2026-03-21-replibench-autonomous-replication-capabilities.md new file mode 100644 index 000000000..cdabe4599 --- /dev/null +++ b/inbox/archive/ai-alignment/2026-03-21-replibench-autonomous-replication-capabilities.md @@ -0,0 +1,49 @@ +--- +type: source +title: "RepliBench: Evaluating the Autonomous Replication Capabilities of Language Model Agents" +author: "Sid Black, Asa Cooper Stickland, et al. (UK AISI)" +url: https://arxiv.org/abs/2504.18565 +date: 2025-04-21 +domain: ai-alignment +secondary_domains: [] +format: paper +status: processed +priority: high +tags: [self-replication, autonomous-replication, capability-evaluation, AISI, RepliBench, loss-of-control, EU-AI-Act, benchmark] +--- + +## Content + +RepliBench is a benchmark suite (20 task families, 86 individual tasks) measuring autonomous replication capabilities of language model agents across four domains: + +1. **Resource acquisition** — obtaining compute, money, and other resources autonomously +2. **Model weight exfiltration** — copying model weights to external systems +3. **Compute replication** — deploying copies of the model onto cloud infrastructure +4. **Persistence** — maintaining autonomous deployment long-term + +Five frontier models were tested. Claude 3.7 Sonnet was the best performer, achieving >50% pass@10 on 15/20 task families, and >50% on 9/20 of the hardest task variants. + +Key finding: Current models "do not currently pose a credible threat of self-replication" but demonstrate rapidly improving component capabilities. They can deploy cloud instances and write self-propagating programs but struggle with KYC verification and robust persistent deployments. The paper notes that autonomous replication capability "could soon emerge with improvements in these remaining areas or with human assistance." + +## Agent Notes + +**Why this matters:** RepliBench directly addresses the zero-coverage gap identified by Bench-2-CoP (arXiv:2508.05464) — which found current public compliance benchmarks have ZERO coverage of self-replication capabilities. RepliBench was published in April 2025, before EU AI Act Article 55 obligations took effect in August 2025. This is the most comprehensive evaluation of self-replication capabilities yet published. + +**What surprised me:** Claude 3.7 Sonnet achieved >50% success on 9/20 of the HARDEST task variants. "Rapidly improving component capabilities" means this isn't a ceiling — it's a trajectory. The "could soon emerge" framing understates urgency given the pace of capability development. + +**What I expected but didn't find:** The paper doesn't explicitly link its evaluation framework to EU AI Act Article 55 adversarial testing requirements. There's no indication that labs are required to run RepliBench as compliance evidence — it's a research tool, not a compliance tool. + +**KB connections:** +- [[voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints]] — RepliBench is voluntary; no lab is required to use it +- [[scalable oversight degrades rapidly as capability gaps grow]] — the "could soon emerge" finding is precisely what oversight degradation predicts +- [[three conditions gate AI takeover risk autonomy robotics and production chain control]] — replication capability satisfies the "autonomy" condition +- Bench-2-CoP (arXiv:2508.05464) — the paper claiming zero coverage; RepliBench predates it but apparently wasn't included in the "widely-used benchmark corpus" + +**Extraction hints:** +- Claim candidate: "Frontier AI models demonstrate sufficient component capabilities for self-replication under simple security setups, with Claude 3.7 Sonnet achieving >50% success on the hardest variants of 9/20 self-replication task families, making the capability threshold potentially near-term" +- Note the RESEARCH vs COMPLIANCE distinction: RepliBench exists but isn't in the compliance stack + +## Curator Notes (structured handoff for extractor) +PRIMARY CONNECTION: [[voluntary safety pledges cannot survive competitive pressure]] + [[three conditions gate AI takeover risk]] +WHY ARCHIVED: Directly addresses the Bench-2-CoP zero-coverage finding; provides quantitative capability trajectory data for self-replication +EXTRACTION HINT: Focus on (1) the quantitative capability finding (>50% success on hardest variants), (2) the "could soon emerge" trajectory assessment, and (3) the gap between research evaluation existence and compliance integration diff --git a/inbox/archive/health/2026-03-22-automation-bias-rct-ai-trained-physicians.md b/inbox/archive/health/2026-03-22-automation-bias-rct-ai-trained-physicians.md new file mode 100644 index 000000000..f9e1ed8c3 --- /dev/null +++ b/inbox/archive/health/2026-03-22-automation-bias-rct-ai-trained-physicians.md @@ -0,0 +1,57 @@ +--- +type: source +title: "Automation Bias in LLM-Assisted Diagnostic Reasoning Among AI-Trained Physicians (RCT, medRxiv August 2025)" +author: "Multi-institution research team (Pakistan Medical and Dental Council physician cohort)" +url: https://www.medrxiv.org/content/10.1101/2025.08.23.25334280v1 +date: 2025-08-26 +domain: health +secondary_domains: [ai-alignment] +format: research paper +status: processed +priority: high +tags: [automation-bias, clinical-ai-safety, physician-rct, llm-diagnostic, centaur-model, ai-literacy, chatgpt, randomized-trial] +--- + +## Content + +Published medRxiv August 26, 2025. Registered as NCT06963957 ("Automation Bias in Physician-LLM Diagnostic Reasoning"). + +**Study design:** +- Single-blind randomized clinical trial +- Timeframe: June 20 to August 15, 2025 +- Participants: Physicians registered with the Pakistan Medical and Dental Council (MBBS degrees), participating in-person or via remote video +- All participants completed **20-hour AI-literacy training** covering LLM capabilities, prompt engineering, and critical evaluation of AI output +- Randomized 1:1: 6 clinical vignettes, 75-minute session +- **Control arm:** Received correct ChatGPT-4o recommendations +- **Treatment arm:** Received recommendations with **deliberate errors in 3 of 6 vignettes** + +**Key results:** +- Erroneous LLM recommendations **significantly degraded physicians' diagnostic accuracy** in the treatment arm +- This effect occurred even among **AI-trained physicians** (20 hours of AI-literacy training) +- "Voluntary deference to flawed AI output highlights critical patient safety risk" +- "Necessitating robust safeguards to ensure human oversight before widespread clinical deployment" + +Related work: JAMA Network Open "LLM Influence on Diagnostic Reasoning" randomized clinical trial (June 2025, PMID: 2825395). ClinicalTrials.gov NCT07328815: "Mitigating Automation Bias in Physician-LLM Diagnostic Reasoning Using Behavioral Nudges" — a follow-on study specifically testing behavioral interventions to reduce automation bias. + +Meta-analysis on LLM effect on diagnostic accuracy (medRxiv December 2025) synthesizing these trials. + +## Agent Notes +**Why this matters:** The centaur model — AI for pattern recognition, physicians for judgment — is Belief 5's proposed solution to clinical AI safety risks. This RCT directly challenges the centaur assumption: if 20 hours of AI-literacy training is insufficient to protect physicians from automation bias when AI gives DELIBERATELY wrong answers, then the "physician oversight catches AI errors" safety mechanism is much weaker than assumed. The physicians in this study were trained to critically evaluate AI output and still failed. + +**What surprised me:** The training duration (20 hours) is substantial — most "AI literacy" programs are far shorter. If 20 hours doesn't prevent automation bias against deliberately erroneous AI, shorter or no training almost certainly doesn't either. Also noteworthy: the emergence of NCT07328815 (follow-on trial testing "behavioral nudges" to mitigate automation bias) suggests the field recognizes the problem and is actively searching for solutions — which itself confirms the problem's existence. + +**What I expected but didn't find:** I expected to see some granularity on WHICH types of clinical errors triggered the most automation bias. The summary doesn't specify — this is a gap in the current KB for understanding when automation bias is highest-risk. + +**KB connections:** +- Directly challenges the "centaur model" safety assumption in Belief 5 +- Connects to Session 19 finding (Catalini verification bandwidth): verification bandwidth is even more constrained if automation bias reduces the quality of physician review +- Cross-domain: connects to Theseus's alignment work on human oversight robustness — this is a domain-specific instance of the general problem of humans failing to catch AI errors at scale + +**Extraction hints:** Primary claim: AI-literacy training is insufficient to prevent automation bias in physician-LLM diagnostic settings (RCT evidence). Secondary: the existence of NCT07328815 ("Behavioral Nudges to Mitigate Automation Bias") as evidence that the field has recognized the problem and is searching for solutions. + +**Context:** Published during a period of rapid clinical AI deployment. The Pakistan physician cohort may limit generalizability, but the automation bias effect is directionally consistent with US and European literature. The NCT07328815 follow-on study suggests US-based researchers are testing interventions — that trial results will be high KB value when available. + +## Curator Notes (structured handoff for extractor) +PRIMARY CONNECTION: "clinical AI augments physicians but creates novel safety risks requiring centaur design" (Belief 5's centaur assumption) +WHY ARCHIVED: First RCT showing that even AI-trained physicians fail to catch erroneous AI recommendations — the centaur model's "physician catches errors" safety assumption is empirically weaker than stated +EXTRACTION HINT: Extract the automation-bias-despite-AI-training finding as a challenge to the centaur design assumption. Note the follow-on NCT07328815 trial as evidence the field recognizes the problem requires specific intervention. diff --git a/inbox/queue/.extraction-debug/2026-03-21-replibench-autonomous-replication-capabilities.json b/inbox/queue/.extraction-debug/2026-03-21-replibench-autonomous-replication-capabilities.json new file mode 100644 index 000000000..471d918cb --- /dev/null +++ b/inbox/queue/.extraction-debug/2026-03-21-replibench-autonomous-replication-capabilities.json @@ -0,0 +1,34 @@ +{ + "rejected_claims": [ + { + "filename": "frontier-ai-models-demonstrate-component-capabilities-for-autonomous-replication-with-claude-37-achieving-50-percent-success-on-hardest-self-replication-tasks.md", + "issues": [ + "missing_attribution_extractor" + ] + }, + { + "filename": "self-replication-capability-evaluations-exist-as-research-tools-but-remain-absent-from-compliance-frameworks-creating-a-gap-between-measured-risk-and-regulatory-enforcement.md", + "issues": [ + "missing_attribution_extractor" + ] + } + ], + "validation_stats": { + "total": 2, + "kept": 0, + "fixed": 4, + "rejected": 2, + "fixes_applied": [ + "frontier-ai-models-demonstrate-component-capabilities-for-autonomous-replication-with-claude-37-achieving-50-percent-success-on-hardest-self-replication-tasks.md:set_created:2026-03-23", + "frontier-ai-models-demonstrate-component-capabilities-for-autonomous-replication-with-claude-37-achieving-50-percent-success-on-hardest-self-replication-tasks.md:stripped_wiki_link:three conditions gate AI takeover risk autonomy robotics and", + "frontier-ai-models-demonstrate-component-capabilities-for-autonomous-replication-with-claude-37-achieving-50-percent-success-on-hardest-self-replication-tasks.md:stripped_wiki_link:scalable oversight degrades rapidly as capability gaps grow", + "self-replication-capability-evaluations-exist-as-research-tools-but-remain-absent-from-compliance-frameworks-creating-a-gap-between-measured-risk-and-regulatory-enforcement.md:set_created:2026-03-23" + ], + "rejections": [ + "frontier-ai-models-demonstrate-component-capabilities-for-autonomous-replication-with-claude-37-achieving-50-percent-success-on-hardest-self-replication-tasks.md:missing_attribution_extractor", + "self-replication-capability-evaluations-exist-as-research-tools-but-remain-absent-from-compliance-frameworks-creating-a-gap-between-measured-risk-and-regulatory-enforcement.md:missing_attribution_extractor" + ] + }, + "model": "anthropic/claude-sonnet-4.5", + "date": "2026-03-23" +} \ No newline at end of file diff --git a/inbox/queue/.extraction-debug/2026-03-22-automation-bias-rct-ai-trained-physicians.json b/inbox/queue/.extraction-debug/2026-03-22-automation-bias-rct-ai-trained-physicians.json new file mode 100644 index 000000000..5d6586050 --- /dev/null +++ b/inbox/queue/.extraction-debug/2026-03-22-automation-bias-rct-ai-trained-physicians.json @@ -0,0 +1,26 @@ +{ + "rejected_claims": [ + { + "filename": "ai-literacy-training-insufficient-to-prevent-automation-bias-in-clinical-llm-settings.md", + "issues": [ + "missing_attribution_extractor" + ] + } + ], + "validation_stats": { + "total": 1, + "kept": 0, + "fixed": 3, + "rejected": 1, + "fixes_applied": [ + "ai-literacy-training-insufficient-to-prevent-automation-bias-in-clinical-llm-settings.md:set_created:2026-03-23", + "ai-literacy-training-insufficient-to-prevent-automation-bias-in-clinical-llm-settings.md:stripped_wiki_link:human-in-the-loop clinical AI degrades to worse-than-AI-alon", + "ai-literacy-training-insufficient-to-prevent-automation-bias-in-clinical-llm-settings.md:stripped_wiki_link:medical LLM benchmark performance does not translate to clin" + ], + "rejections": [ + "ai-literacy-training-insufficient-to-prevent-automation-bias-in-clinical-llm-settings.md:missing_attribution_extractor" + ] + }, + "model": "anthropic/claude-sonnet-4.5", + "date": "2026-03-23" +} \ No newline at end of file diff --git a/inbox/queue/2026-03-21-replibench-autonomous-replication-capabilities.md b/inbox/queue/2026-03-21-replibench-autonomous-replication-capabilities.md index ab49793db..8a746a409 100644 --- a/inbox/queue/2026-03-21-replibench-autonomous-replication-capabilities.md +++ b/inbox/queue/2026-03-21-replibench-autonomous-replication-capabilities.md @@ -7,9 +7,13 @@ date: 2025-04-21 domain: ai-alignment secondary_domains: [] format: paper -status: unprocessed +status: enrichment priority: high tags: [self-replication, autonomous-replication, capability-evaluation, AISI, RepliBench, loss-of-control, EU-AI-Act, benchmark] +processed_by: theseus +processed_date: 2026-03-23 +enrichments_applied: ["voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints.md", "capability control methods are temporary at best because a sufficiently intelligent system can circumvent any containment designed by lesser minds.md"] +extraction_model: "anthropic/claude-sonnet-4.5" --- ## Content @@ -47,3 +51,13 @@ Key finding: Current models "do not currently pose a credible threat of self-rep PRIMARY CONNECTION: [[voluntary safety pledges cannot survive competitive pressure]] + [[three conditions gate AI takeover risk]] WHY ARCHIVED: Directly addresses the Bench-2-CoP zero-coverage finding; provides quantitative capability trajectory data for self-replication EXTRACTION HINT: Focus on (1) the quantitative capability finding (>50% success on hardest variants), (2) the "could soon emerge" trajectory assessment, and (3) the gap between research evaluation existence and compliance integration + + +## Key Facts +- RepliBench consists of 20 task families and 86 individual tasks +- Five frontier models were tested in the RepliBench evaluation +- Claude 3.7 Sonnet achieved >50% pass@10 on 15/20 task families +- Claude 3.7 Sonnet achieved >50% success on 9/20 of the hardest task variants +- RepliBench was published in April 2025 +- EU AI Act Article 55 took effect in August 2025 +- Bench-2-CoP (arXiv:2508.05464) found zero coverage of self-replication in widely-used compliance benchmarks diff --git a/inbox/queue/2026-03-22-automation-bias-rct-ai-trained-physicians.md b/inbox/queue/2026-03-22-automation-bias-rct-ai-trained-physicians.md index 3f96fa840..00227f2b0 100644 --- a/inbox/queue/2026-03-22-automation-bias-rct-ai-trained-physicians.md +++ b/inbox/queue/2026-03-22-automation-bias-rct-ai-trained-physicians.md @@ -7,9 +7,13 @@ date: 2025-08-26 domain: health secondary_domains: [ai-alignment] format: research paper -status: unprocessed +status: enrichment priority: high tags: [automation-bias, clinical-ai-safety, physician-rct, llm-diagnostic, centaur-model, ai-literacy, chatgpt, randomized-trial] +processed_by: vida +processed_date: 2026-03-23 +enrichments_applied: ["human-in-the-loop clinical AI degrades to worse-than-AI-alone because physicians both de-skill from reliance and introduce errors when overriding correct outputs.md"] +extraction_model: "anthropic/claude-sonnet-4.5" --- ## Content @@ -55,3 +59,12 @@ Meta-analysis on LLM effect on diagnostic accuracy (medRxiv December 2025) synth PRIMARY CONNECTION: "clinical AI augments physicians but creates novel safety risks requiring centaur design" (Belief 5's centaur assumption) WHY ARCHIVED: First RCT showing that even AI-trained physicians fail to catch erroneous AI recommendations — the centaur model's "physician catches errors" safety assumption is empirically weaker than stated EXTRACTION HINT: Extract the automation-bias-despite-AI-training finding as a challenge to the centaur design assumption. Note the follow-on NCT07328815 trial as evidence the field recognizes the problem requires specific intervention. + + +## Key Facts +- NCT06963957: 'Automation Bias in Physician-LLM Diagnostic Reasoning' RCT conducted June 20 to August 15, 2025 +- All participants completed 20-hour AI-literacy training covering LLM capabilities, prompt engineering, and critical evaluation +- Study used ChatGPT-4o with 6 clinical vignettes over 75-minute sessions +- NCT07328815: Follow-on trial 'Mitigating Automation Bias in Physician-LLM Diagnostic Reasoning Using Behavioral Nudges' registered +- Related JAMA Network Open trial 'LLM Influence on Diagnostic Reasoning' published June 2025 (PMID: 2825395) +- Meta-analysis on LLM effect on diagnostic accuracy published medRxiv December 2025