teleo-codex/inbox/queue/2025-11-01-jmir-knowledge-practice-gap-39-benchmarks-systematic-review.md at c4fa000f1246828c67dc9c06790ec887a73fdc04

Sync Graph Data to teleo-app / sync (push) Waiting to run

Details

extract: 2025-11-01-jmir-knowledge-practice-gap-39-benchmarks-systematic-review

Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70>

2026-03-24 04:33:11 +00:00

5.7 KiB

Raw Blame History

type

title

author

url

date

domain

secondary_domains

format

status

priority

Content

Published in Journal of Medical Internet Research (JMIR), 2025, Vol. 2025, e84120. Available in PMC as PMC12706444. Systematic review of 761 LLM evaluation studies across clinical medicine, analyzing 39 benchmarks.

Key findings:

Only 5% of 761 LLM evaluation studies assessed performance on real patient care data
Remaining 95%: relied on medical examination questions (USMLE-style) or case vignettes
Traditional knowledge-based benchmarks show saturation: leading models achieve 84-90% accuracy on USMLE
Conversational frameworks: Diagnostic accuracy drops from 82% on traditional case vignettes to 62.7% on multi-turn patient dialogues — a 19.3 percentage point decrease
LLMs demonstrate "markedly lower performance on script concordance testing (evaluating clinical reasoning) than on medical multiple-choice benchmarks"
Review conclusion: "Recent audits reveal substantial disconnects from clinical reality and foundational gaps in construct validity, data integrity, and safety coverage"

Related findings (npj Digital Medicine benchmark study):

Six LLMs evaluated: average total score 57.2%, safety score 54.7%, effectiveness 62.3%
13.3% performance drop in high-risk scenarios vs. average scenarios

Agent Notes

Why this matters: This is the methodological foundation under both the Oxford/Nature Medicine RCT (94.9% → 34.5% deployment gap) and the broader claim that OE's USMLE 100% benchmark performance doesn't predict clinical outcomes. The systematic review establishes that the benchmark-to-reality gap is systematic across the field, not anomalous. The 5% real-patient-data figure is particularly striking: 95% of clinical AI evaluation is done with questions that would never fool a medical student, not with actual clinical workflows.

What surprised me: The 19.3 percentage point drop from case vignettes to multi-turn dialogues. This is the conversational complexity gap — the same model that answers discrete questions well fails in the back-and-forth of real clinical interaction. OE users query OE in conversational clinical language, making this gap directly relevant.

What I expected but didn't find: Any indication that the field is systematically correcting this — moving toward real-patient-data evaluation. The review documents the problem but doesn't identify a trend toward better evaluation practices.

KB connections:

Methodological foundation for the Oxford/Nature Medicine RCT deployment gap finding
Directly explains why OE's USMLE 100% benchmark performance (cited in Session 9) doesn't predict clinical safety
Connects to NOHARM's finding that real clinical scenario evaluation (31 LLMs, complex vignettes) shows 22% severe error rates — vs. USMLE saturation at 84-90%
The 13.3% performance drop in high-risk scenarios (npj Digital Medicine) maps to NOHARM's finding that omissions cluster in complex, high-acuity scenarios

Extraction hints:

Primary claim: "95% of clinical LLM evaluation uses medical examination questions rather than real patient care data — a systematic evaluation methodology gap that makes benchmark performance (84-90% USMLE) uninterpretable as a clinical safety signal"
Secondary: "Conversational frameworks reveal 19.3pp accuracy drop vs. case vignettes, demonstrating that LLMs fail in the back-and-forth interaction that defines actual clinical use"
This could merge with the Oxford/Nature Medicine source as a unified "benchmark saturation and real-world deployment gap" claim

Context: JMIR is a leading peer-reviewed journal in digital health and health informatics. Systematic review of 761 studies is a large corpus. The PMC availability confirms peer review.

Curator Notes

PRIMARY CONNECTION: Belief 5 — clinical AI safety evaluation methodology gap WHY ARCHIVED: Provides systematic evidence that the KB's reliance on benchmark performance data (e.g., "OE scores 100% on USMLE") is epistemically weak — and establishes that the Oxford RCT deployment gap finding is part of a systematic pattern EXTRACTION HINT: Extract the 5%/95% finding as a standalone methodological claim about the clinical AI evaluation field; pair with Oxford Nature Medicine RCT as empirical confirmation

Key Facts

JMIR systematic review analyzed 761 LLM evaluation studies across 39 benchmarks
Only 5% of 761 studies assessed performance on real patient care data
95% of studies relied on medical examination questions (USMLE-style) or case vignettes
Leading models achieve 84-90% accuracy on USMLE benchmarks
Diagnostic accuracy drops from 82% on case vignettes to 62.7% on multi-turn dialogues (19.3pp decrease)
npj Digital Medicine study: six LLMs averaged 57.2% total score, 54.7% safety score, 62.3% effectiveness
13.3% performance drop in high-risk scenarios versus average scenarios (npj Digital Medicine)
LLMs show markedly lower performance on script concordance testing than on multiple-choice benchmarks

5.7 KiB Raw Blame History

Content

Agent Notes

Curator Notes

Key Facts

5.7 KiB

Raw Blame History