extract: 2026-02-10-oxford-nature-medicine-llm-public-medical-advice-rct
Some checks are pending
Sync Graph Data to teleo-app / sync (push) Waiting to run

Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70>
This commit is contained in:
Teleo Agents 2026-03-24 04:34:06 +00:00
parent 059714cab1
commit 55930169c6
4 changed files with 52 additions and 1 deletions

View file

@ -53,6 +53,12 @@ NCT07328815 tests whether a UI-layer behavioral nudge (ensemble-LLM confidence s
RCT evidence (NCT06963957, medRxiv August 2025) shows automation bias persists even after 20 hours of AI-literacy training specifically designed to teach critical evaluation of AI output. Physicians with this training still voluntarily deferred to deliberately erroneous LLM recommendations in 3 of 6 clinical vignettes, demonstrating that the human-in-the-loop degradation mechanism operates even when humans are extensively trained to resist it.
### Additional Evidence (extend)
*Source: [[2026-02-10-oxford-nature-medicine-llm-public-medical-advice-rct]] | Added: 2026-03-24*
Oxford RCT 2026 documents a complementary failure mode: while automation bias causes physicians to defer to wrong AI, the deployment gap shows users fail to extract correct guidance from right AI. Both erase clinical value but through opposite mechanisms—one from over-reliance, one from under-extraction. The deployment gap produced zero improvement over control (not degradation), distinguishing it from automation bias which actively worsens outcomes.

View file

@ -40,6 +40,12 @@ ARISE report identifies specific failure modes: real-world performance 'breaks d
JMIR systematic review of 761 studies provides methodological foundation: 95% of clinical LLM evaluation uses medical exam questions rather than real patient data, with only 5% assessing performance on actual patient care. Traditional benchmarks show saturation at 84-90% USMLE accuracy, but conversational frameworks reveal 19.3pp accuracy drop (82% → 62.7%) when moving from case vignettes to multi-turn dialogues. Review concludes: 'substantial disconnects from clinical reality and foundational gaps in construct validity, data integrity, and safety coverage.' This establishes that the Oxford/Nature Medicine RCT deployment gap (94.9% → 34.5%) is part of a systematic field-wide pattern, not an isolated finding.
### Additional Evidence (extend)
*Source: [[2026-02-10-oxford-nature-medicine-llm-public-medical-advice-rct]] | Added: 2026-03-24*
Oxford Nature Medicine 2026 RCT (n=1,298) extends the benchmark-to-clinical-impact gap to public users: LLMs achieved 94.9% condition identification in isolation but users assisted by LLMs performed no better than control groups (<34.5%). The 60-point deployment gap held across GPT-4o, Llama 3, and Command R+, indicating the interaction modenot the modelexplains the failure. Root cause identified as 'two-way communication breakdown' where users couldn't extract correct guidance even when AI possessed the right answer.

View file

@ -0,0 +1,24 @@
{
"rejected_claims": [
{
"filename": "llm-medical-advice-shows-60-point-deployment-gap-between-benchmark-performance-and-user-assisted-outcomes.md",
"issues": [
"missing_attribution_extractor"
]
}
],
"validation_stats": {
"total": 1,
"kept": 0,
"fixed": 1,
"rejected": 1,
"fixes_applied": [
"llm-medical-advice-shows-60-point-deployment-gap-between-benchmark-performance-and-user-assisted-outcomes.md:set_created:2026-03-24"
],
"rejections": [
"llm-medical-advice-shows-60-point-deployment-gap-between-benchmark-performance-and-user-assisted-outcomes.md:missing_attribution_extractor"
]
},
"model": "anthropic/claude-sonnet-4.5",
"date": "2026-03-24"
}

View file

@ -7,10 +7,14 @@ date: 2026-02-10
domain: health
secondary_domains: [ai-alignment]
format: research-paper
status: unprocessed
status: enrichment
priority: high
tags: [clinical-ai-safety, llm-medical-advice, real-world-deployment, benchmark-performance-gap, automation-bias, public-health-ai, belief-5, oxford]
flagged_for_theseus: ["Real-world deployment gap between LLM benchmark performance and user interaction outcomes — AI safety/alignment implication beyond healthcare"]
processed_by: vida
processed_date: 2026-03-24
enrichments_applied: ["medical LLM benchmark performance does not translate to clinical impact because physicians with and without AI access achieve similar diagnostic accuracy in randomized trials.md", "human-in-the-loop clinical AI degrades to worse-than-AI-alone because physicians both de-skill from reliance and introduce errors when overriding correct outputs.md"]
extraction_model: "anthropic/claude-sonnet-4.5"
---
## Content
@ -57,3 +61,14 @@ Press coverage: University of Oxford newsroom (Feb 10), The Register ("AI chatbo
PRIMARY CONNECTION: Belief 5 "clinical AI augments but creates novel safety risks requiring centaur design" — fifth failure mode documented
WHY ARCHIVED: Establishes the real-world deployment gap as distinct from automation bias; challenges the assumption that high benchmark performance predicts improved clinical outcomes
EXTRACTION HINT: Extract as standalone claim — distinguish from automation bias (different mechanism: there, physician defers to wrong AI; here, user fails to extract correct guidance from right AI)
## Key Facts
- Oxford Internet Institute and Nuffield Department of Primary Care published RCT in Nature Medicine, February 2026, Vol. 32, p. 609615
- Study enrolled 1,298 participants across 10 medical scenarios
- LLMs tested: GPT-4o, Llama 3, Command R+
- LLM solo performance: 94.9% condition identification, 56.3% appropriate disposition
- User-assisted performance: <34.5% condition identification, <44.2% appropriate disposition
- Control group (traditional methods) performed comparably to LLM-assisted group
- Study was preregistered and randomized
- MLCommons co-sponsored the research