From 55930169c6f7a1a0e0ed152769b82cdda5089d1b Mon Sep 17 00:00:00 2001 From: Teleo Agents Date: Tue, 24 Mar 2026 04:34:06 +0000 Subject: [PATCH] extract: 2026-02-10-oxford-nature-medicine-llm-public-medical-advice-rct Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70> --- ... errors when overriding correct outputs.md | 6 +++++ ...iagnostic accuracy in randomized trials.md | 6 +++++ ...edicine-llm-public-medical-advice-rct.json | 24 +++++++++++++++++++ ...-medicine-llm-public-medical-advice-rct.md | 17 ++++++++++++- 4 files changed, 52 insertions(+), 1 deletion(-) create mode 100644 inbox/queue/.extraction-debug/2026-02-10-oxford-nature-medicine-llm-public-medical-advice-rct.json diff --git a/domains/health/human-in-the-loop clinical AI degrades to worse-than-AI-alone because physicians both de-skill from reliance and introduce errors when overriding correct outputs.md b/domains/health/human-in-the-loop clinical AI degrades to worse-than-AI-alone because physicians both de-skill from reliance and introduce errors when overriding correct outputs.md index 2cadc630..e3664409 100644 --- a/domains/health/human-in-the-loop clinical AI degrades to worse-than-AI-alone because physicians both de-skill from reliance and introduce errors when overriding correct outputs.md +++ b/domains/health/human-in-the-loop clinical AI degrades to worse-than-AI-alone because physicians both de-skill from reliance and introduce errors when overriding correct outputs.md @@ -53,6 +53,12 @@ NCT07328815 tests whether a UI-layer behavioral nudge (ensemble-LLM confidence s RCT evidence (NCT06963957, medRxiv August 2025) shows automation bias persists even after 20 hours of AI-literacy training specifically designed to teach critical evaluation of AI output. Physicians with this training still voluntarily deferred to deliberately erroneous LLM recommendations in 3 of 6 clinical vignettes, demonstrating that the human-in-the-loop degradation mechanism operates even when humans are extensively trained to resist it. +### Additional Evidence (extend) +*Source: [[2026-02-10-oxford-nature-medicine-llm-public-medical-advice-rct]] | Added: 2026-03-24* + +Oxford RCT 2026 documents a complementary failure mode: while automation bias causes physicians to defer to wrong AI, the deployment gap shows users fail to extract correct guidance from right AI. Both erase clinical value but through opposite mechanisms—one from over-reliance, one from under-extraction. The deployment gap produced zero improvement over control (not degradation), distinguishing it from automation bias which actively worsens outcomes. + + diff --git a/domains/health/medical LLM benchmark performance does not translate to clinical impact because physicians with and without AI access achieve similar diagnostic accuracy in randomized trials.md b/domains/health/medical LLM benchmark performance does not translate to clinical impact because physicians with and without AI access achieve similar diagnostic accuracy in randomized trials.md index bb919b4c..9265e6e5 100644 --- a/domains/health/medical LLM benchmark performance does not translate to clinical impact because physicians with and without AI access achieve similar diagnostic accuracy in randomized trials.md +++ b/domains/health/medical LLM benchmark performance does not translate to clinical impact because physicians with and without AI access achieve similar diagnostic accuracy in randomized trials.md @@ -40,6 +40,12 @@ ARISE report identifies specific failure modes: real-world performance 'breaks d JMIR systematic review of 761 studies provides methodological foundation: 95% of clinical LLM evaluation uses medical exam questions rather than real patient data, with only 5% assessing performance on actual patient care. Traditional benchmarks show saturation at 84-90% USMLE accuracy, but conversational frameworks reveal 19.3pp accuracy drop (82% → 62.7%) when moving from case vignettes to multi-turn dialogues. Review concludes: 'substantial disconnects from clinical reality and foundational gaps in construct validity, data integrity, and safety coverage.' This establishes that the Oxford/Nature Medicine RCT deployment gap (94.9% → 34.5%) is part of a systematic field-wide pattern, not an isolated finding. +### Additional Evidence (extend) +*Source: [[2026-02-10-oxford-nature-medicine-llm-public-medical-advice-rct]] | Added: 2026-03-24* + +Oxford Nature Medicine 2026 RCT (n=1,298) extends the benchmark-to-clinical-impact gap to public users: LLMs achieved 94.9% condition identification in isolation but users assisted by LLMs performed no better than control groups (<34.5%). The 60-point deployment gap held across GPT-4o, Llama 3, and Command R+, indicating the interaction mode—not the model—explains the failure. Root cause identified as 'two-way communication breakdown' where users couldn't extract correct guidance even when AI possessed the right answer. + + diff --git a/inbox/queue/.extraction-debug/2026-02-10-oxford-nature-medicine-llm-public-medical-advice-rct.json b/inbox/queue/.extraction-debug/2026-02-10-oxford-nature-medicine-llm-public-medical-advice-rct.json new file mode 100644 index 00000000..ea0b4d9c --- /dev/null +++ b/inbox/queue/.extraction-debug/2026-02-10-oxford-nature-medicine-llm-public-medical-advice-rct.json @@ -0,0 +1,24 @@ +{ + "rejected_claims": [ + { + "filename": "llm-medical-advice-shows-60-point-deployment-gap-between-benchmark-performance-and-user-assisted-outcomes.md", + "issues": [ + "missing_attribution_extractor" + ] + } + ], + "validation_stats": { + "total": 1, + "kept": 0, + "fixed": 1, + "rejected": 1, + "fixes_applied": [ + "llm-medical-advice-shows-60-point-deployment-gap-between-benchmark-performance-and-user-assisted-outcomes.md:set_created:2026-03-24" + ], + "rejections": [ + "llm-medical-advice-shows-60-point-deployment-gap-between-benchmark-performance-and-user-assisted-outcomes.md:missing_attribution_extractor" + ] + }, + "model": "anthropic/claude-sonnet-4.5", + "date": "2026-03-24" +} \ No newline at end of file diff --git a/inbox/queue/2026-02-10-oxford-nature-medicine-llm-public-medical-advice-rct.md b/inbox/queue/2026-02-10-oxford-nature-medicine-llm-public-medical-advice-rct.md index be204d53..2d025010 100644 --- a/inbox/queue/2026-02-10-oxford-nature-medicine-llm-public-medical-advice-rct.md +++ b/inbox/queue/2026-02-10-oxford-nature-medicine-llm-public-medical-advice-rct.md @@ -7,10 +7,14 @@ date: 2026-02-10 domain: health secondary_domains: [ai-alignment] format: research-paper -status: unprocessed +status: enrichment priority: high tags: [clinical-ai-safety, llm-medical-advice, real-world-deployment, benchmark-performance-gap, automation-bias, public-health-ai, belief-5, oxford] flagged_for_theseus: ["Real-world deployment gap between LLM benchmark performance and user interaction outcomes — AI safety/alignment implication beyond healthcare"] +processed_by: vida +processed_date: 2026-03-24 +enrichments_applied: ["medical LLM benchmark performance does not translate to clinical impact because physicians with and without AI access achieve similar diagnostic accuracy in randomized trials.md", "human-in-the-loop clinical AI degrades to worse-than-AI-alone because physicians both de-skill from reliance and introduce errors when overriding correct outputs.md"] +extraction_model: "anthropic/claude-sonnet-4.5" --- ## Content @@ -57,3 +61,14 @@ Press coverage: University of Oxford newsroom (Feb 10), The Register ("AI chatbo PRIMARY CONNECTION: Belief 5 "clinical AI augments but creates novel safety risks requiring centaur design" — fifth failure mode documented WHY ARCHIVED: Establishes the real-world deployment gap as distinct from automation bias; challenges the assumption that high benchmark performance predicts improved clinical outcomes EXTRACTION HINT: Extract as standalone claim — distinguish from automation bias (different mechanism: there, physician defers to wrong AI; here, user fails to extract correct guidance from right AI) + + +## Key Facts +- Oxford Internet Institute and Nuffield Department of Primary Care published RCT in Nature Medicine, February 2026, Vol. 32, p. 609–615 +- Study enrolled 1,298 participants across 10 medical scenarios +- LLMs tested: GPT-4o, Llama 3, Command R+ +- LLM solo performance: 94.9% condition identification, 56.3% appropriate disposition +- User-assisted performance: <34.5% condition identification, <44.2% appropriate disposition +- Control group (traditional methods) performed comparably to LLM-assisted group +- Study was preregistered and randomized +- MLCommons co-sponsored the research