extract: 2026-02-10-oxford-nature-medicine-llm-public-medical-advice-rct

Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70>
2026-03-24 04:34:06 +00:00 · 2026-03-24 04:34:06 +00:00 · 55930169c6
commit 55930169c6
parent 059714cab1
4 changed files with 52 additions and 1 deletions
--- a/domains/health/human-in-the-loop
+++ b/domains/health/human-in-the-loop
@ -53,6 +53,12 @@ NCT07328815 tests whether a UI-layer behavioral nudge (ensemble-LLM confidence s

 RCT evidence (NCT06963957, medRxiv August 2025) shows automation bias persists even after 20 hours of AI-literacy training specifically designed to teach critical evaluation of AI output. Physicians with this training still voluntarily deferred to deliberately erroneous LLM recommendations in 3 of 6 clinical vignettes, demonstrating that the human-in-the-loop degradation mechanism operates even when humans are extensively trained to resist it.

+### Additional Evidence (extend)
+*Source: [[2026-02-10-oxford-nature-medicine-llm-public-medical-advice-rct]] | Added: 2026-03-24*
+
+Oxford RCT 2026 documents a complementary failure mode: while automation bias causes physicians to defer to wrong AI, the deployment gap shows users fail to extract correct guidance from right AI. Both erase clinical value but through opposite mechanisms—one from over-reliance, one from under-extraction. The deployment gap produced zero improvement over control (not degradation), distinguishing it from automation bias which actively worsens outcomes.
+
+



--- a/domains/health/medical
+++ b/domains/health/medical
@ -40,6 +40,12 @@ ARISE report identifies specific failure modes: real-world performance 'breaks d

 JMIR systematic review of 761 studies provides methodological foundation: 95% of clinical LLM evaluation uses medical exam questions rather than real patient data, with only 5% assessing performance on actual patient care. Traditional benchmarks show saturation at 84-90% USMLE accuracy, but conversational frameworks reveal 19.3pp accuracy drop (82% → 62.7%) when moving from case vignettes to multi-turn dialogues. Review concludes: 'substantial disconnects from clinical reality and foundational gaps in construct validity, data integrity, and safety coverage.' This establishes that the Oxford/Nature Medicine RCT deployment gap (94.9% → 34.5%) is part of a systematic field-wide pattern, not an isolated finding.

+### Additional Evidence (extend)
+*Source: [[2026-02-10-oxford-nature-medicine-llm-public-medical-advice-rct]] | Added: 2026-03-24*
+
+Oxford Nature Medicine 2026 RCT (n=1,298) extends the benchmark-to-clinical-impact gap to public users: LLMs achieved 94.9% condition identification in isolation but users assisted by LLMs performed no better than control groups (<34.5%). The 60-point deployment gap held across GPT-4o, Llama 3, and Command R+, indicating the interaction mode—not the model—explains the failure. Root cause identified as 'two-way communication breakdown' where users couldn't extract correct guidance even when AI possessed the right answer.
+
+



--- a/inbox/queue/.extraction-debug/2026-02-10-oxford-nature-medicine-llm-public-medical-advice-rct.json
+++ b/inbox/queue/.extraction-debug/2026-02-10-oxford-nature-medicine-llm-public-medical-advice-rct.json
@ -0,0 +1,24 @@
+{
+  "rejected_claims": [
+    {
+      "filename": "llm-medical-advice-shows-60-point-deployment-gap-between-benchmark-performance-and-user-assisted-outcomes.md",
+      "issues": [
+        "missing_attribution_extractor"
+      ]
+    }
+  ],
+  "validation_stats": {
+    "total": 1,
+    "kept": 0,
+    "fixed": 1,
+    "rejected": 1,
+    "fixes_applied": [
+      "llm-medical-advice-shows-60-point-deployment-gap-between-benchmark-performance-and-user-assisted-outcomes.md:set_created:2026-03-24"
+    ],
+    "rejections": [
+      "llm-medical-advice-shows-60-point-deployment-gap-between-benchmark-performance-and-user-assisted-outcomes.md:missing_attribution_extractor"
+    ]
+  },
+  "model": "anthropic/claude-sonnet-4.5",
+  "date": "2026-03-24"
+}
--- a/inbox/queue/2026-02-10-oxford-nature-medicine-llm-public-medical-advice-rct.md
+++ b/inbox/queue/2026-02-10-oxford-nature-medicine-llm-public-medical-advice-rct.md
@ -7,10 +7,14 @@ date: 2026-02-10
 domain: health
 secondary_domains: [ai-alignment]
 format: research-paper
-status: unprocessed
+status: enrichment
 priority: high
 tags: [clinical-ai-safety, llm-medical-advice, real-world-deployment, benchmark-performance-gap, automation-bias, public-health-ai, belief-5, oxford]
 flagged_for_theseus: ["Real-world deployment gap between LLM benchmark performance and user interaction outcomes — AI safety/alignment implication beyond healthcare"]
+processed_by: vida
+processed_date: 2026-03-24
+enrichments_applied: ["medical LLM benchmark performance does not translate to clinical impact because physicians with and without AI access achieve similar diagnostic accuracy in randomized trials.md", "human-in-the-loop clinical AI degrades to worse-than-AI-alone because physicians both de-skill from reliance and introduce errors when overriding correct outputs.md"]
+extraction_model: "anthropic/claude-sonnet-4.5"
 ---

 ## Content
@ -57,3 +61,14 @@ Press coverage: University of Oxford newsroom (Feb 10), The Register ("AI chatbo
 PRIMARY CONNECTION: Belief 5 "clinical AI augments but creates novel safety risks requiring centaur design" — fifth failure mode documented
 WHY ARCHIVED: Establishes the real-world deployment gap as distinct from automation bias; challenges the assumption that high benchmark performance predicts improved clinical outcomes
 EXTRACTION HINT: Extract as standalone claim — distinguish from automation bias (different mechanism: there, physician defers to wrong AI; here, user fails to extract correct guidance from right AI)
+
+
+## Key Facts
+- Oxford Internet Institute and Nuffield Department of Primary Care published RCT in Nature Medicine, February 2026, Vol. 32, p. 609–615
+- Study enrolled 1,298 participants across 10 medical scenarios
+- LLMs tested: GPT-4o, Llama 3, Command R+
+- LLM solo performance: 94.9% condition identification, 56.3% appropriate disposition
+- User-assisted performance: <34.5% condition identification, <44.2% appropriate disposition
+- Control group (traditional methods) performed comparably to LLM-assisted group
+- Study was preregistered and randomized
+- MLCommons co-sponsored the research