teleo-codex/inbox/queue/2026-04-21-goh-jama-llm-diagnostic-reasoning-rct.md
Teleo Agents 5c234a2364 auto-fix: strip 4 broken wiki links
Pipeline auto-fixer: removed [[ ]] brackets from links
that don't resolve to existing claims in the knowledge base.
2026-04-21 04:35:44 +00:00

5.5 KiB

type title author url date domain secondary_domains format status priority tags
source RCT: LLM access does not significantly improve physician diagnostic reasoning — AI alone scored 16 points higher than physicians using it Goh et al. (JAMA Network Open, October 2024) https://pmc.ncbi.nlm.nih.gov/articles/PMC11519755/ 2024-10-28 health
ai-alignment
rct unprocessed high
clinical-ai
LLM
diagnostic-reasoning
RCT
physician-performance
human-AI-team

Content

Full citation: Goh E et al. "Large Language Model Influence on Diagnostic Reasoning: A Randomized Clinical Trial." JAMA Network Open. October 28, 2024. PMC11519755.

Study design: Single-blind RCT, stratified by career stage. 50 physicians (26 attendings, 24 residents). 244 clinical cases.

Key findings:

  1. LLM access did NOT significantly improve diagnostic reasoning: Median score 76% (LLM group) vs. 74% (conventional resources) — non-significant 2-point difference.

  2. AI alone scored 16 points higher than physicians using it: The LLM standalone outperformed the human-AI team by 16 percentage points (74%+16=90%? — exact numbers not specified but the gap is 16 points). This is the most alarming finding: physicians with AI access performed no better than those without, while the AI alone would have performed substantially better.

  3. Companion RCT (different cognitive task): LLM assistance DID improve management reasoning — suggesting the AI benefit is task-specific. Diagnosis vs. treatment planning benefit unevenly from AI support.

  4. No durable skill evidence: Single-session study, no longitudinal tracking, no washout condition.

Interpretation: This is a different failure mode from deskilling — it's integration failure. Physicians fail to extract AI capability, achieving no improvement despite access to a 90%+ diagnostic tool. The team performs at the level of the human, not the AI.

Why this is distinct from deskilling:

  • Deskilling: skill degrades after AI exposure when AI is removed
  • Integration failure (Goh 2024): skill does not improve despite AI access; the human anchors the team performance, not the AI
  • Both produce the same outcome (human performance level), but through different mechanisms

Agent Notes

Why this matters: The Goh 2024 RCT is methodologically the strongest evidence on the human-AI diagnostic team question — it's a real RCT (not observational) with reasonable sample size for a physician study. The null result is damning in a different way than deskilling: physicians aren't being harmed by the AI (no deskilling measured), but they're also not benefiting from it in the most common clinical task (diagnosis). The AI alone performs dramatically better, but the human-AI team doesn't outperform humans without AI.

What surprised me: The 16-point gap between AI-alone and the human-AI team. This is the clearest evidence of integration failure in the literature. The benefit that could be extracted from the AI isn't being extracted. This connects to centaur design — the centaur only works if the human and AI roles are structurally separated.

What I expected but didn't find: Stratification by AI experience or digital health literacy. Presumably, physicians who use AI tools more regularly would extract more benefit. The JAMA paper presumably stratified by career stage (attending vs. resident) but the agent didn't report whether that moderated the result.

KB connections:

Extraction hints:

  • The existing KB claim medical LLM benchmark performance does not translate to clinical impact... may already be grounded in Goh 2024. Check whether the existing claim file references this PMID before extracting.
  • The "integration failure" concept — AI alone outperforms human-AI team because humans fail to extract AI capability — is worth adding to the existing claim as enrichment
  • The management reasoning companion RCT (AI DOES improve treatment planning but NOT diagnosis) is worth noting as a scope qualifier

Curator Notes

PRIMARY CONNECTION: medical LLM benchmark performance does not translate to clinical impact because physicians with and without AI access achieve similar diagnostic accuracy in randomized trials

WHY ARCHIVED: This may be the primary evidence source for the existing KB claim — if so, archive as enrichment. The "integration failure" mechanism (AI alone scores 16 points higher than human-AI team) is the strongest new element.

EXTRACTION HINT: Check if this is already cited in the existing claim file before extracting. If it is, this is enrichment (add the 16-point gap finding and the management reasoning exception). If not, it's a primary source for the existing claim.