teleo-codex/domains/health/medical LLM benchmark performance does not translate to clinical impact because physicians with and without AI access achieve similar diagnostic accuracy in randomized trials.md
m3taversal 673c751b76
leo: foundations audit — 7 moves, 4 deletes, 3 condensations, 10 confidence demotions, 23 type fixes, 1 centaur rewrite
## Summary
Comprehensive audit of all 86 foundation claims across 4 subdomains.

**Changes:**
- 7 claims moved (3 → domains/ai-alignment/, 3 → core/teleohumanity/, 1 → domains/health/)
- 4 claims deleted (1 duplicate, 3 condensed into stronger claims)
- 3 condensations: cognitive limits 3→2, Christensen 4→2
- 10 confidence demotions (proven→likely for interpretive framings)
- 23 type fixes (framework/insight/pattern → claim per schema)
- 1 centaur rewrite (unconditional → conditional on role complementarity)
- All broken wiki links fixed across repo

**Review:** All 4 domain agents approved (Rio, Clay, Vida, Theseus).

Pentagon-Agent: Leo <76FB9BCA-CC16-4479-B3E5-25A3769B3D7E>
2026-03-07 11:56:38 -07:00

3.5 KiB

description type domain created source confidence
OpenEvidence scored 100 percent on USMLE and GPT-4 outperforms ED residents on structured cases but a multi-hospital RCT showed no diagnostic accuracy improvement with AI access suggesting the value of clinical AI is workflow efficiency not diagnostic augmentation claim health 2026-02-17 OpenEvidence USMLE 100%; GPT-4 vs ED physicians (PMC 2024); UVA/Stanford/Harvard randomized trial (Stanford HAI 2025) likely

medical LLM benchmark performance does not translate to clinical impact because physicians with and without AI access achieve similar diagnostic accuracy in randomized trials

Medical LLMs have reached and surpassed human benchmarks: OpenEvidence scored 100% on USMLE, Med-PaLM 2 achieved 86.5% on MedQA, and GPT-4 outperformed ED resident physicians in diagnostic accuracy for structured internal medicine cases. But a UVA/Stanford/Harvard randomized trial found that physicians with and without ChatGPT access achieved similar diagnostic accuracy -- the tool did not meaningfully improve performance even when available. GPT-4 also missed almost every second diagnosis in a systematic evaluation of radiological cases despite scoring well on structured exams.

This gap between benchmarks and clinical reality has structural explanations. Standardized exams test pattern recognition on complete case presentations. Real clinical encounters involve ambiguous symptoms, incomplete information, and the need to integrate patient context, values, and preferences. The physician's value-add is not information retrieval (where AI excels) but contextual judgment (where AI adds little).

A deeper finding from a Stanford/Harvard study challenges even the "similar accuracy" conclusion: when physicians diagnosed complex clinical scenarios alone they achieved 65% accuracy, with AI access 68%, but AI alone achieved 90%. The physician's input actively degraded AI performance by 22 percentage points. This suggests the problem is not that AI fails to help physicians -- it is that physicians override correct AI outputs based on intuition, introducing systematic errors (since human-in-the-loop clinical AI degrades to worse-than-AI-alone because physicians both de-skill from reliance and introduce errors when overriding correct outputs).

The implication for AI deployment strategy: the highest-value clinical AI applications are not diagnostic augmentation but workflow automation (ambient documentation, administrative burden reduction) and safety netting (AI triage catching missed findings). The centaur model may still apply to medicine, but the interaction design must prevent physicians from overriding AI on tasks where AI demonstrably outperforms -- a politically and ethically charged constraint.


Relevant Notes:

Topics:

  • livingip overview
  • health and wellness