pipeline: archive 1 source(s) post-merge

Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70>
2026-03-24 04:45:58 +00:00 · 2026-03-24 04:45:58 +00:00 · 99751d55f9
commit 99751d55f9
parent 55930169c6
1 changed files with 59 additions and 0 deletions
--- a/inbox/archive/health/2026-02-10-oxford-nature-medicine-llm-public-medical-advice-rct.md
+++ b/inbox/archive/health/2026-02-10-oxford-nature-medicine-llm-public-medical-advice-rct.md
@ -0,0 +1,59 @@
 ---
 type: source
 title: "Nature Medicine 2026: LLM Clinical Knowledge Does Not Translate to User Interactions — RCT With 1,298 Participants"
 author: "Oxford Internet Institute & Nuffield Dept of Primary Care (University of Oxford, MLCommons et al.)"
 url: https://www.nature.com/articles/s41591-025-04074-y
 date: 2026-02-10
 domain: health
 secondary_domains: [ai-alignment]
 format: research-paper
 status: processed
 priority: high
 tags: [clinical-ai-safety, llm-medical-advice, real-world-deployment, benchmark-performance-gap, automation-bias, public-health-ai, belief-5, oxford]
 flagged_for_theseus: ["Real-world deployment gap between LLM benchmark performance and user interaction outcomes — AI safety/alignment implication beyond healthcare"]
 ---
 ## Content
 Published in *Nature Medicine*, February 2026 (Vol. 32, p. 609–615). Lead institution: Oxford Internet Institute and Nuffield Department of Primary Care Health Sciences, University of Oxford. Randomized, preregistered study with 1,298 participants.
 **Study design:** Participants were randomly assigned to use an LLM (GPT-4o, Llama 3, Command R+) or a source of their choice (control) to navigate 10 medical scenarios. Measured: correct condition identification and appropriate disposition (e.g., seek emergency care vs. wait-and-see).
 **Key findings:**
 - **LLMs tested alone:** Correctly identified conditions in **94.9%** of cases; correct disposition in **56.3%** on average (state-of-the-art benchmark performance).
 - **Participants using LLMs:** Identified relevant conditions in **fewer than 34.5%** of cases; disposition in **fewer than 44.2%** — **NO BETTER THAN CONTROL GROUP** using traditional methods (online search, own judgment).
 - The gap: 94.9% → 34.5% condition accuracy (a 60-percentage-point collapse) in real user interaction.
 - Root cause: **"Two-way communication breakdown"** — users didn't know what information the LLMs needed; LLM responses frequently mixed good and poor recommendations, making it difficult to identify correct action.
 - Study conclusion: "Current evaluation methods do not reflect the complexity of interacting with human users."
 - Key call: "Just as clinical trials are required for medications, AI systems need rigorous testing with diverse, real users to understand their true capabilities."
 Press coverage: University of Oxford newsroom (Feb 10), The Register ("AI chatbots don't improve medical advice, study finds"), NIHR Oxford BRC.
 **Important scope note:** This study evaluated PUBLIC use (general population navigating medical scenarios) — NOT physician use (like OpenEvidence). But the underlying mechanism (communication breakdown, mixed-quality response interpretation) is not specific to untrained users.
 ## Agent Notes
 **Why this matters:** This is a NEW (fifth) clinical AI safety failure mode distinct from the four documented in Sessions 8-11: (1) omission-reinforcement, (2) demographic bias amplification, (3) automation bias robustness, (4) medical misinformation propagation. This fifth mode is the **real-world deployment gap** — LLMs perform well in isolation on benchmarks but this performance does not translate to improved user outcomes in actual interaction. The 60-percentage-point gap between LLM solo performance (94.9%) and user-assisted performance (<34.5%) is structurally important.
 **What surprised me:** The control group performed comparably to the LLM-assisted group. This means LLMs added ZERO measurable benefit over existing information-seeking behavior for the general public in medical scenarios. This is not "LLMs made things worse" (no harm signal) — it's "LLMs failed to improve over what people already do." That's the null result that clinical AI proponents have never wanted to confront directly.
 **What I expected but didn't find:** A nuanced finding that better-designed LLMs (GPT-4o vs. Llama 3) outperformed simpler ones in real-world use. The study used three different LLMs and the result held across all — it's the INTERACTION mode, not the model, that explains the gap.
 **KB connections:**
 - Fifth distinct clinical AI safety failure mode: "real-world deployment gap" (benchmark performance does not predict user-assisted outcome improvement)
 - Directly relevant to the JMIR 2025 systematic review finding that only 5% of LLM evaluations used real patient care data — this study is part of the ~5% that does
 - Connects to OE's USMLE 100% benchmark performance cited in the knowledge base — if OE is tested alone it likely performs at benchmark; but physician interactions with OE may suffer from a similar deployment gap
 - Compounds with automation bias finding (NCT06963957): physicians defer to AI even when it's wrong; public users fail to extract correct guidance even when AI knows the right answer. Two different failure modes, both erasing clinical value.
 - Connects to the Knowledge-Practice Gap systematic review (JMIR 2025 — 39 benchmarks, only 5% real patient data)
 **Extraction hints:**
 - Primary claim: "LLMs achieve 94.9% condition identification accuracy in isolation but participants using the same LLMs perform no better than control groups (<34.5%), establishing a real-world deployment gap between LLM knowledge and user-assisted outcome improvement"
 - The deployment gap is a SCOPE issue: OE is physician-facing (not public-facing), so the mechanism may be weaker for OE — but the zero-improvement-over-control result for informed users is still a serious evidentiary challenge to clinical AI value claims
 - Flag this for Theseus: the benchmark-to-deployment gap is a general AI safety concern, not just healthcare-specific
 **Context:** Oxford Internet Institute is a leading AI-society research center. MLCommons co-sponsorship adds credibility (they also run HELM benchmarks). Published in Nature Medicine — highest-tier clinical AI venue. Preregistered RCT — highest evidence level.
 ## Curator Notes
 PRIMARY CONNECTION: Belief 5 "clinical AI augments but creates novel safety risks requiring centaur design" — fifth failure mode documented
 WHY ARCHIVED: Establishes the real-world deployment gap as distinct from automation bias; challenges the assumption that high benchmark performance predicts improved clinical outcomes
 EXTRACTION HINT: Extract as standalone claim — distinguish from automation bias (different mechanism: there, physician defers to wrong AI; here, user fails to extract correct guidance from right AI)