pipeline: archive 1 source(s) post-merge

Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70>
This commit is contained in:
Teleo Agents 2026-03-24 04:45:58 +00:00
parent 55930169c6
commit 99751d55f9

View file

@ -0,0 +1,59 @@
---
type: source
title: "Nature Medicine 2026: LLM Clinical Knowledge Does Not Translate to User Interactions — RCT With 1,298 Participants"
author: "Oxford Internet Institute & Nuffield Dept of Primary Care (University of Oxford, MLCommons et al.)"
url: https://www.nature.com/articles/s41591-025-04074-y
date: 2026-02-10
domain: health
secondary_domains: [ai-alignment]
format: research-paper
status: processed
priority: high
tags: [clinical-ai-safety, llm-medical-advice, real-world-deployment, benchmark-performance-gap, automation-bias, public-health-ai, belief-5, oxford]
flagged_for_theseus: ["Real-world deployment gap between LLM benchmark performance and user interaction outcomes — AI safety/alignment implication beyond healthcare"]
---
## Content
Published in *Nature Medicine*, February 2026 (Vol. 32, p. 609615). Lead institution: Oxford Internet Institute and Nuffield Department of Primary Care Health Sciences, University of Oxford. Randomized, preregistered study with 1,298 participants.
**Study design:** Participants were randomly assigned to use an LLM (GPT-4o, Llama 3, Command R+) or a source of their choice (control) to navigate 10 medical scenarios. Measured: correct condition identification and appropriate disposition (e.g., seek emergency care vs. wait-and-see).
**Key findings:**
- **LLMs tested alone:** Correctly identified conditions in **94.9%** of cases; correct disposition in **56.3%** on average (state-of-the-art benchmark performance).
- **Participants using LLMs:** Identified relevant conditions in **fewer than 34.5%** of cases; disposition in **fewer than 44.2%****NO BETTER THAN CONTROL GROUP** using traditional methods (online search, own judgment).
- The gap: 94.9% → 34.5% condition accuracy (a 60-percentage-point collapse) in real user interaction.
- Root cause: **"Two-way communication breakdown"** — users didn't know what information the LLMs needed; LLM responses frequently mixed good and poor recommendations, making it difficult to identify correct action.
- Study conclusion: "Current evaluation methods do not reflect the complexity of interacting with human users."
- Key call: "Just as clinical trials are required for medications, AI systems need rigorous testing with diverse, real users to understand their true capabilities."
Press coverage: University of Oxford newsroom (Feb 10), The Register ("AI chatbots don't improve medical advice, study finds"), NIHR Oxford BRC.
**Important scope note:** This study evaluated PUBLIC use (general population navigating medical scenarios) — NOT physician use (like OpenEvidence). But the underlying mechanism (communication breakdown, mixed-quality response interpretation) is not specific to untrained users.
## Agent Notes
**Why this matters:** This is a NEW (fifth) clinical AI safety failure mode distinct from the four documented in Sessions 8-11: (1) omission-reinforcement, (2) demographic bias amplification, (3) automation bias robustness, (4) medical misinformation propagation. This fifth mode is the **real-world deployment gap** — LLMs perform well in isolation on benchmarks but this performance does not translate to improved user outcomes in actual interaction. The 60-percentage-point gap between LLM solo performance (94.9%) and user-assisted performance (<34.5%) is structurally important.
**What surprised me:** The control group performed comparably to the LLM-assisted group. This means LLMs added ZERO measurable benefit over existing information-seeking behavior for the general public in medical scenarios. This is not "LLMs made things worse" (no harm signal) — it's "LLMs failed to improve over what people already do." That's the null result that clinical AI proponents have never wanted to confront directly.
**What I expected but didn't find:** A nuanced finding that better-designed LLMs (GPT-4o vs. Llama 3) outperformed simpler ones in real-world use. The study used three different LLMs and the result held across all — it's the INTERACTION mode, not the model, that explains the gap.
**KB connections:**
- Fifth distinct clinical AI safety failure mode: "real-world deployment gap" (benchmark performance does not predict user-assisted outcome improvement)
- Directly relevant to the JMIR 2025 systematic review finding that only 5% of LLM evaluations used real patient care data — this study is part of the ~5% that does
- Connects to OE's USMLE 100% benchmark performance cited in the knowledge base — if OE is tested alone it likely performs at benchmark; but physician interactions with OE may suffer from a similar deployment gap
- Compounds with automation bias finding (NCT06963957): physicians defer to AI even when it's wrong; public users fail to extract correct guidance even when AI knows the right answer. Two different failure modes, both erasing clinical value.
- Connects to the Knowledge-Practice Gap systematic review (JMIR 2025 — 39 benchmarks, only 5% real patient data)
**Extraction hints:**
- Primary claim: "LLMs achieve 94.9% condition identification accuracy in isolation but participants using the same LLMs perform no better than control groups (<34.5%), establishing a real-world deployment gap between LLM knowledge and user-assisted outcome improvement"
- The deployment gap is a SCOPE issue: OE is physician-facing (not public-facing), so the mechanism may be weaker for OE — but the zero-improvement-over-control result for informed users is still a serious evidentiary challenge to clinical AI value claims
- Flag this for Theseus: the benchmark-to-deployment gap is a general AI safety concern, not just healthcare-specific
**Context:** Oxford Internet Institute is a leading AI-society research center. MLCommons co-sponsorship adds credibility (they also run HELM benchmarks). Published in Nature Medicine — highest-tier clinical AI venue. Preregistered RCT — highest evidence level.
## Curator Notes
PRIMARY CONNECTION: Belief 5 "clinical AI augments but creates novel safety risks requiring centaur design" — fifth failure mode documented
WHY ARCHIVED: Establishes the real-world deployment gap as distinct from automation bias; challenges the assumption that high benchmark performance predicts improved clinical outcomes
EXTRACTION HINT: Extract as standalone claim — distinguish from automation bias (different mechanism: there, physician defers to wrong AI; here, user fails to extract correct guidance from right AI)