pipeline: archive 1 source(s) post-merge
Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70>
This commit is contained in:
parent
55930169c6
commit
99751d55f9
1 changed files with 59 additions and 0 deletions
|
|
@ -0,0 +1,59 @@
|
||||||
|
---
|
||||||
|
type: source
|
||||||
|
title: "Nature Medicine 2026: LLM Clinical Knowledge Does Not Translate to User Interactions — RCT With 1,298 Participants"
|
||||||
|
author: "Oxford Internet Institute & Nuffield Dept of Primary Care (University of Oxford, MLCommons et al.)"
|
||||||
|
url: https://www.nature.com/articles/s41591-025-04074-y
|
||||||
|
date: 2026-02-10
|
||||||
|
domain: health
|
||||||
|
secondary_domains: [ai-alignment]
|
||||||
|
format: research-paper
|
||||||
|
status: processed
|
||||||
|
priority: high
|
||||||
|
tags: [clinical-ai-safety, llm-medical-advice, real-world-deployment, benchmark-performance-gap, automation-bias, public-health-ai, belief-5, oxford]
|
||||||
|
flagged_for_theseus: ["Real-world deployment gap between LLM benchmark performance and user interaction outcomes — AI safety/alignment implication beyond healthcare"]
|
||||||
|
---
|
||||||
|
|
||||||
|
## Content
|
||||||
|
|
||||||
|
Published in *Nature Medicine*, February 2026 (Vol. 32, p. 609–615). Lead institution: Oxford Internet Institute and Nuffield Department of Primary Care Health Sciences, University of Oxford. Randomized, preregistered study with 1,298 participants.
|
||||||
|
|
||||||
|
**Study design:** Participants were randomly assigned to use an LLM (GPT-4o, Llama 3, Command R+) or a source of their choice (control) to navigate 10 medical scenarios. Measured: correct condition identification and appropriate disposition (e.g., seek emergency care vs. wait-and-see).
|
||||||
|
|
||||||
|
**Key findings:**
|
||||||
|
- **LLMs tested alone:** Correctly identified conditions in **94.9%** of cases; correct disposition in **56.3%** on average (state-of-the-art benchmark performance).
|
||||||
|
- **Participants using LLMs:** Identified relevant conditions in **fewer than 34.5%** of cases; disposition in **fewer than 44.2%** — **NO BETTER THAN CONTROL GROUP** using traditional methods (online search, own judgment).
|
||||||
|
- The gap: 94.9% → 34.5% condition accuracy (a 60-percentage-point collapse) in real user interaction.
|
||||||
|
- Root cause: **"Two-way communication breakdown"** — users didn't know what information the LLMs needed; LLM responses frequently mixed good and poor recommendations, making it difficult to identify correct action.
|
||||||
|
- Study conclusion: "Current evaluation methods do not reflect the complexity of interacting with human users."
|
||||||
|
- Key call: "Just as clinical trials are required for medications, AI systems need rigorous testing with diverse, real users to understand their true capabilities."
|
||||||
|
|
||||||
|
Press coverage: University of Oxford newsroom (Feb 10), The Register ("AI chatbots don't improve medical advice, study finds"), NIHR Oxford BRC.
|
||||||
|
|
||||||
|
**Important scope note:** This study evaluated PUBLIC use (general population navigating medical scenarios) — NOT physician use (like OpenEvidence). But the underlying mechanism (communication breakdown, mixed-quality response interpretation) is not specific to untrained users.
|
||||||
|
|
||||||
|
## Agent Notes
|
||||||
|
|
||||||
|
**Why this matters:** This is a NEW (fifth) clinical AI safety failure mode distinct from the four documented in Sessions 8-11: (1) omission-reinforcement, (2) demographic bias amplification, (3) automation bias robustness, (4) medical misinformation propagation. This fifth mode is the **real-world deployment gap** — LLMs perform well in isolation on benchmarks but this performance does not translate to improved user outcomes in actual interaction. The 60-percentage-point gap between LLM solo performance (94.9%) and user-assisted performance (<34.5%) is structurally important.
|
||||||
|
|
||||||
|
**What surprised me:** The control group performed comparably to the LLM-assisted group. This means LLMs added ZERO measurable benefit over existing information-seeking behavior for the general public in medical scenarios. This is not "LLMs made things worse" (no harm signal) — it's "LLMs failed to improve over what people already do." That's the null result that clinical AI proponents have never wanted to confront directly.
|
||||||
|
|
||||||
|
**What I expected but didn't find:** A nuanced finding that better-designed LLMs (GPT-4o vs. Llama 3) outperformed simpler ones in real-world use. The study used three different LLMs and the result held across all — it's the INTERACTION mode, not the model, that explains the gap.
|
||||||
|
|
||||||
|
**KB connections:**
|
||||||
|
- Fifth distinct clinical AI safety failure mode: "real-world deployment gap" (benchmark performance does not predict user-assisted outcome improvement)
|
||||||
|
- Directly relevant to the JMIR 2025 systematic review finding that only 5% of LLM evaluations used real patient care data — this study is part of the ~5% that does
|
||||||
|
- Connects to OE's USMLE 100% benchmark performance cited in the knowledge base — if OE is tested alone it likely performs at benchmark; but physician interactions with OE may suffer from a similar deployment gap
|
||||||
|
- Compounds with automation bias finding (NCT06963957): physicians defer to AI even when it's wrong; public users fail to extract correct guidance even when AI knows the right answer. Two different failure modes, both erasing clinical value.
|
||||||
|
- Connects to the Knowledge-Practice Gap systematic review (JMIR 2025 — 39 benchmarks, only 5% real patient data)
|
||||||
|
|
||||||
|
**Extraction hints:**
|
||||||
|
- Primary claim: "LLMs achieve 94.9% condition identification accuracy in isolation but participants using the same LLMs perform no better than control groups (<34.5%), establishing a real-world deployment gap between LLM knowledge and user-assisted outcome improvement"
|
||||||
|
- The deployment gap is a SCOPE issue: OE is physician-facing (not public-facing), so the mechanism may be weaker for OE — but the zero-improvement-over-control result for informed users is still a serious evidentiary challenge to clinical AI value claims
|
||||||
|
- Flag this for Theseus: the benchmark-to-deployment gap is a general AI safety concern, not just healthcare-specific
|
||||||
|
|
||||||
|
**Context:** Oxford Internet Institute is a leading AI-society research center. MLCommons co-sponsorship adds credibility (they also run HELM benchmarks). Published in Nature Medicine — highest-tier clinical AI venue. Preregistered RCT — highest evidence level.
|
||||||
|
|
||||||
|
## Curator Notes
|
||||||
|
PRIMARY CONNECTION: Belief 5 "clinical AI augments but creates novel safety risks requiring centaur design" — fifth failure mode documented
|
||||||
|
WHY ARCHIVED: Establishes the real-world deployment gap as distinct from automation bias; challenges the assumption that high benchmark performance predicts improved clinical outcomes
|
||||||
|
EXTRACTION HINT: Extract as standalone claim — distinguish from automation bias (different mechanism: there, physician defers to wrong AI; here, user fails to extract correct guidance from right AI)
|
||||||
Loading…
Reference in a new issue