Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70>
5.7 KiB
| type | title | author | url | date | domain | secondary_domains | format | status | priority | tags | processed_by | processed_date | enrichments_applied | extraction_model | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| source | JMIR 2025 Systematic Review: Knowledge-Practice Performance Gap in Clinical LLMs — Only 5% of 761 Studies Used Real Patient Data | JMIR authors (systematic review team) | https://www.jmir.org/2025/1/e84120 | 2025-11-01 | health |
|
research-paper | enrichment | medium |
|
vida | 2026-03-24 |
|
anthropic/claude-sonnet-4.5 |
Content
Published in Journal of Medical Internet Research (JMIR), 2025, Vol. 2025, e84120. Available in PMC as PMC12706444. Systematic review of 761 LLM evaluation studies across clinical medicine, analyzing 39 benchmarks.
Key findings:
- Only 5% of 761 LLM evaluation studies assessed performance on real patient care data
- Remaining 95%: relied on medical examination questions (USMLE-style) or case vignettes
- Traditional knowledge-based benchmarks show saturation: leading models achieve 84-90% accuracy on USMLE
- Conversational frameworks: Diagnostic accuracy drops from 82% on traditional case vignettes to 62.7% on multi-turn patient dialogues — a 19.3 percentage point decrease
- LLMs demonstrate "markedly lower performance on script concordance testing (evaluating clinical reasoning) than on medical multiple-choice benchmarks"
- Review conclusion: "Recent audits reveal substantial disconnects from clinical reality and foundational gaps in construct validity, data integrity, and safety coverage"
Related findings (npj Digital Medicine benchmark study):
- Six LLMs evaluated: average total score 57.2%, safety score 54.7%, effectiveness 62.3%
- 13.3% performance drop in high-risk scenarios vs. average scenarios
Agent Notes
Why this matters: This is the methodological foundation under both the Oxford/Nature Medicine RCT (94.9% → 34.5% deployment gap) and the broader claim that OE's USMLE 100% benchmark performance doesn't predict clinical outcomes. The systematic review establishes that the benchmark-to-reality gap is systematic across the field, not anomalous. The 5% real-patient-data figure is particularly striking: 95% of clinical AI evaluation is done with questions that would never fool a medical student, not with actual clinical workflows.
What surprised me: The 19.3 percentage point drop from case vignettes to multi-turn dialogues. This is the conversational complexity gap — the same model that answers discrete questions well fails in the back-and-forth of real clinical interaction. OE users query OE in conversational clinical language, making this gap directly relevant.
What I expected but didn't find: Any indication that the field is systematically correcting this — moving toward real-patient-data evaluation. The review documents the problem but doesn't identify a trend toward better evaluation practices.
KB connections:
- Methodological foundation for the Oxford/Nature Medicine RCT deployment gap finding
- Directly explains why OE's USMLE 100% benchmark performance (cited in Session 9) doesn't predict clinical safety
- Connects to NOHARM's finding that real clinical scenario evaluation (31 LLMs, complex vignettes) shows 22% severe error rates — vs. USMLE saturation at 84-90%
- The 13.3% performance drop in high-risk scenarios (npj Digital Medicine) maps to NOHARM's finding that omissions cluster in complex, high-acuity scenarios
Extraction hints:
- Primary claim: "95% of clinical LLM evaluation uses medical examination questions rather than real patient care data — a systematic evaluation methodology gap that makes benchmark performance (84-90% USMLE) uninterpretable as a clinical safety signal"
- Secondary: "Conversational frameworks reveal 19.3pp accuracy drop vs. case vignettes, demonstrating that LLMs fail in the back-and-forth interaction that defines actual clinical use"
- This could merge with the Oxford/Nature Medicine source as a unified "benchmark saturation and real-world deployment gap" claim
Context: JMIR is a leading peer-reviewed journal in digital health and health informatics. Systematic review of 761 studies is a large corpus. The PMC availability confirms peer review.
Curator Notes
PRIMARY CONNECTION: Belief 5 — clinical AI safety evaluation methodology gap WHY ARCHIVED: Provides systematic evidence that the KB's reliance on benchmark performance data (e.g., "OE scores 100% on USMLE") is epistemically weak — and establishes that the Oxford RCT deployment gap finding is part of a systematic pattern EXTRACTION HINT: Extract the 5%/95% finding as a standalone methodological claim about the clinical AI evaluation field; pair with Oxford Nature Medicine RCT as empirical confirmation
Key Facts
- JMIR systematic review analyzed 761 LLM evaluation studies across 39 benchmarks
- Only 5% of 761 studies assessed performance on real patient care data
- 95% of studies relied on medical examination questions (USMLE-style) or case vignettes
- Leading models achieve 84-90% accuracy on USMLE benchmarks
- Diagnostic accuracy drops from 82% on case vignettes to 62.7% on multi-turn dialogues (19.3pp decrease)
- npj Digital Medicine study: six LLMs averaged 57.2% total score, 54.7% safety score, 62.3% effectiveness
- 13.3% performance drop in high-risk scenarios versus average scenarios (npj Digital Medicine)
- LLMs show markedly lower performance on script concordance testing than on multiple-choice benchmarks