teleo-codex/inbox/queue/2026-04-21-goh-jama-llm-diagnostic-reasoning-rct.md at 87e5267cb15fd1a365e43355844c7e4e60b77f29

Teleo Agents 5c234a2364 auto-fix: strip 4 broken wiki links

Pipeline auto-fixer: removed [[ ]] brackets from links
that don't resolve to existing claims in the knowledge base.

2026-04-21 04:35:44 +00:00

5.5 KiB

Raw Blame History

type

title

author

url

date

domain

secondary_domains

format

status

priority

Content

Full citation: Goh E et al. "Large Language Model Influence on Diagnostic Reasoning: A Randomized Clinical Trial." JAMA Network Open. October 28, 2024. PMC11519755.

Study design: Single-blind RCT, stratified by career stage. 50 physicians (26 attendings, 24 residents). 244 clinical cases.

Key findings:

LLM access did NOT significantly improve diagnostic reasoning: Median score 76% (LLM group) vs. 74% (conventional resources) — non-significant 2-point difference.
AI alone scored 16 points higher than physicians using it: The LLM standalone outperformed the human-AI team by 16 percentage points (74%+16=90%? — exact numbers not specified but the gap is 16 points). This is the most alarming finding: physicians with AI access performed no better than those without, while the AI alone would have performed substantially better.
Companion RCT (different cognitive task): LLM assistance DID improve management reasoning — suggesting the AI benefit is task-specific. Diagnosis vs. treatment planning benefit unevenly from AI support.
No durable skill evidence: Single-session study, no longitudinal tracking, no washout condition.

Interpretation: This is a different failure mode from deskilling — it's integration failure. Physicians fail to extract AI capability, achieving no improvement despite access to a 90%+ diagnostic tool. The team performs at the level of the human, not the AI.

Why this is distinct from deskilling:

Deskilling: skill degrades after AI exposure when AI is removed
Integration failure (Goh 2024): skill does not improve despite AI access; the human anchors the team performance, not the AI
Both produce the same outcome (human performance level), but through different mechanisms

Agent Notes

Why this matters: The Goh 2024 RCT is methodologically the strongest evidence on the human-AI diagnostic team question — it's a real RCT (not observational) with reasonable sample size for a physician study. The null result is damning in a different way than deskilling: physicians aren't being harmed by the AI (no deskilling measured), but they're also not benefiting from it in the most common clinical task (diagnosis). The AI alone performs dramatically better, but the human-AI team doesn't outperform humans without AI.

What surprised me: The 16-point gap between AI-alone and the human-AI team. This is the clearest evidence of integration failure in the literature. The benefit that could be extracted from the AI isn't being extracted. This connects to centaur design — the centaur only works if the human and AI roles are structurally separated.

What I expected but didn't find: Stratification by AI experience or digital health literacy. Presumably, physicians who use AI tools more regularly would extract more benefit. The JAMA paper presumably stratified by career stage (attending vs. resident) but the agent didn't report whether that moderated the result.

KB connections:

human-in-the-loop clinical AI degrades to worse-than-AI-alone because physicians both de-skill from reliance and introduce errors when overriding correct outputs — the Goh 2024 study adds a related but distinct mechanism: even without deskilling, human-AI teams underperform AI alone because of integration failure (not extracting AI capability)
centaur team performance depends on role complementarity not mere human-AI combination — this study is precisely what the centaur claim predicts: simply having access to AI doesn't create centaur performance; role complementarity requires deliberate design
medical LLM benchmark performance does not translate to clinical impact because physicians with and without AI access achieve similar diagnostic accuracy in randomized trials — this is the same study referenced in the existing KB claim! The Goh 2024 study IS the grounding evidence for this KB claim.

Extraction hints:

The existing KB claim medical LLM benchmark performance does not translate to clinical impact... may already be grounded in Goh 2024. Check whether the existing claim file references this PMID before extracting.
The "integration failure" concept — AI alone outperforms human-AI team because humans fail to extract AI capability — is worth adding to the existing claim as enrichment
The management reasoning companion RCT (AI DOES improve treatment planning but NOT diagnosis) is worth noting as a scope qualifier

Curator Notes

PRIMARY CONNECTION: medical LLM benchmark performance does not translate to clinical impact because physicians with and without AI access achieve similar diagnostic accuracy in randomized trials

WHY ARCHIVED: This may be the primary evidence source for the existing KB claim — if so, archive as enrichment. The "integration failure" mechanism (AI alone scores 16 points higher than human-AI team) is the strongest new element.

EXTRACTION HINT: Check if this is already cited in the existing claim file before extracting. If it is, this is enrichment (add the 16-point gap finding and the management reasoning exception). If not, it's a primary source for the existing claim.

5.5 KiB Raw Blame History

Content

Agent Notes

Curator Notes

5.5 KiB

Raw Blame History