diff --git a/inbox/archive/health/2026-02-10-oxford-nature-medicine-llm-public-medical-advice-rct.md b/inbox/archive/health/2026-02-10-oxford-nature-medicine-llm-public-medical-advice-rct.md new file mode 100644 index 00000000..74c349d0 --- /dev/null +++ b/inbox/archive/health/2026-02-10-oxford-nature-medicine-llm-public-medical-advice-rct.md @@ -0,0 +1,59 @@ +--- +type: source +title: "Nature Medicine 2026: LLM Clinical Knowledge Does Not Translate to User Interactions — RCT With 1,298 Participants" +author: "Oxford Internet Institute & Nuffield Dept of Primary Care (University of Oxford, MLCommons et al.)" +url: https://www.nature.com/articles/s41591-025-04074-y +date: 2026-02-10 +domain: health +secondary_domains: [ai-alignment] +format: research-paper +status: processed +priority: high +tags: [clinical-ai-safety, llm-medical-advice, real-world-deployment, benchmark-performance-gap, automation-bias, public-health-ai, belief-5, oxford] +flagged_for_theseus: ["Real-world deployment gap between LLM benchmark performance and user interaction outcomes — AI safety/alignment implication beyond healthcare"] +--- + +## Content + +Published in *Nature Medicine*, February 2026 (Vol. 32, p. 609–615). Lead institution: Oxford Internet Institute and Nuffield Department of Primary Care Health Sciences, University of Oxford. Randomized, preregistered study with 1,298 participants. + +**Study design:** Participants were randomly assigned to use an LLM (GPT-4o, Llama 3, Command R+) or a source of their choice (control) to navigate 10 medical scenarios. Measured: correct condition identification and appropriate disposition (e.g., seek emergency care vs. wait-and-see). + +**Key findings:** +- **LLMs tested alone:** Correctly identified conditions in **94.9%** of cases; correct disposition in **56.3%** on average (state-of-the-art benchmark performance). +- **Participants using LLMs:** Identified relevant conditions in **fewer than 34.5%** of cases; disposition in **fewer than 44.2%** — **NO BETTER THAN CONTROL GROUP** using traditional methods (online search, own judgment). +- The gap: 94.9% → 34.5% condition accuracy (a 60-percentage-point collapse) in real user interaction. +- Root cause: **"Two-way communication breakdown"** — users didn't know what information the LLMs needed; LLM responses frequently mixed good and poor recommendations, making it difficult to identify correct action. +- Study conclusion: "Current evaluation methods do not reflect the complexity of interacting with human users." +- Key call: "Just as clinical trials are required for medications, AI systems need rigorous testing with diverse, real users to understand their true capabilities." + +Press coverage: University of Oxford newsroom (Feb 10), The Register ("AI chatbots don't improve medical advice, study finds"), NIHR Oxford BRC. + +**Important scope note:** This study evaluated PUBLIC use (general population navigating medical scenarios) — NOT physician use (like OpenEvidence). But the underlying mechanism (communication breakdown, mixed-quality response interpretation) is not specific to untrained users. + +## Agent Notes + +**Why this matters:** This is a NEW (fifth) clinical AI safety failure mode distinct from the four documented in Sessions 8-11: (1) omission-reinforcement, (2) demographic bias amplification, (3) automation bias robustness, (4) medical misinformation propagation. This fifth mode is the **real-world deployment gap** — LLMs perform well in isolation on benchmarks but this performance does not translate to improved user outcomes in actual interaction. The 60-percentage-point gap between LLM solo performance (94.9%) and user-assisted performance (<34.5%) is structurally important. + +**What surprised me:** The control group performed comparably to the LLM-assisted group. This means LLMs added ZERO measurable benefit over existing information-seeking behavior for the general public in medical scenarios. This is not "LLMs made things worse" (no harm signal) — it's "LLMs failed to improve over what people already do." That's the null result that clinical AI proponents have never wanted to confront directly. + +**What I expected but didn't find:** A nuanced finding that better-designed LLMs (GPT-4o vs. Llama 3) outperformed simpler ones in real-world use. The study used three different LLMs and the result held across all — it's the INTERACTION mode, not the model, that explains the gap. + +**KB connections:** +- Fifth distinct clinical AI safety failure mode: "real-world deployment gap" (benchmark performance does not predict user-assisted outcome improvement) +- Directly relevant to the JMIR 2025 systematic review finding that only 5% of LLM evaluations used real patient care data — this study is part of the ~5% that does +- Connects to OE's USMLE 100% benchmark performance cited in the knowledge base — if OE is tested alone it likely performs at benchmark; but physician interactions with OE may suffer from a similar deployment gap +- Compounds with automation bias finding (NCT06963957): physicians defer to AI even when it's wrong; public users fail to extract correct guidance even when AI knows the right answer. Two different failure modes, both erasing clinical value. +- Connects to the Knowledge-Practice Gap systematic review (JMIR 2025 — 39 benchmarks, only 5% real patient data) + +**Extraction hints:** +- Primary claim: "LLMs achieve 94.9% condition identification accuracy in isolation but participants using the same LLMs perform no better than control groups (<34.5%), establishing a real-world deployment gap between LLM knowledge and user-assisted outcome improvement" +- The deployment gap is a SCOPE issue: OE is physician-facing (not public-facing), so the mechanism may be weaker for OE — but the zero-improvement-over-control result for informed users is still a serious evidentiary challenge to clinical AI value claims +- Flag this for Theseus: the benchmark-to-deployment gap is a general AI safety concern, not just healthcare-specific + +**Context:** Oxford Internet Institute is a leading AI-society research center. MLCommons co-sponsorship adds credibility (they also run HELM benchmarks). Published in Nature Medicine — highest-tier clinical AI venue. Preregistered RCT — highest evidence level. + +## Curator Notes +PRIMARY CONNECTION: Belief 5 "clinical AI augments but creates novel safety risks requiring centaur design" — fifth failure mode documented +WHY ARCHIVED: Establishes the real-world deployment gap as distinct from automation bias; challenges the assumption that high benchmark performance predicts improved clinical outcomes +EXTRACTION HINT: Extract as standalone claim — distinguish from automation bias (different mechanism: there, physician defers to wrong AI; here, user fails to extract correct guidance from right AI)