Some checks failed
Mirror PR to Forgejo / mirror (pull_request) Has been cancelled
- Source: inbox/queue/2026-04-22-pmc11780016-radiology-ai-upskilling-study-2025.md - Domain: health - Claims: 0, Entities: 0 - Enrichments: 3 - Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5) Pentagon-Agent: Vida <PIPELINE>
85 lines
9.5 KiB
Markdown
85 lines
9.5 KiB
Markdown
---
|
|
type: divergence
|
|
domain: health
|
|
description: One study shows physicians + AI perform 22 points worse than AI alone on diagnostics. Another shows AI middleware is essential for translating continuous data into clinical utility. The answer determines whether healthcare AI should replace or augment human judgment.
|
|
created: 2026-03-19
|
|
status: open
|
|
secondary_domains: ["ai-alignment", "collective-intelligence"]
|
|
title: Does human oversight improve or degrade AI clinical decision-making?
|
|
claims: ["human-in-the-loop clinical AI degrades to worse-than-AI-alone because physicians both de-skill from reliance and introduce errors when overriding correct outputs.md", "AI middleware bridges consumer wearable data to clinical utility because continuous data is too voluminous for direct clinician review.md"]
|
|
surfaced_by: leo
|
|
related: ["divergence-human-ai-clinical-collaboration-enhance-or-degrade", "the physician role shifts from information processor to relationship manager as AI automates documentation triage and evidence synthesis", "ai-induced-deskilling-follows-consistent-cross-specialty-pattern-in-medicine", "medical LLM benchmark performance does not translate to clinical impact because physicians with and without AI access achieve similar diagnostic accuracy in randomized trials", "clinical-ai-creates-three-distinct-skill-failure-modes-deskilling-misskilling-neverskilling", "no-peer-reviewed-evidence-of-durable-physician-upskilling-from-ai-exposure-as-of-mid-2026"]
|
|
---
|
|
|
|
# Does human oversight improve or degrade AI clinical decision-making?
|
|
|
|
These claims imply opposite deployment models for healthcare AI. One says remove humans from the diagnostic loop — they make it worse. The other says AI must translate and filter for human judgment — continuous data requires AI as intermediary.
|
|
|
|
The degradation claim cites Stanford/Harvard data: AI alone achieves 90% accuracy on specific diagnostic tasks, but physicians with AI access achieve only 68% — a 22-point degradation. The mechanism is dual: de-skilling (physicians lose diagnostic sharpness after relying on AI) and override errors (physicians override correct AI outputs based on incorrect clinical intuition). After 3 months of colonoscopy AI assistance, physician standalone performance dropped measurably.
|
|
|
|
The middleware claim argues AI's clinical value is as a translator between raw continuous data (wearables, CGMs, remote monitoring) and actionable clinical insights. The volume of data from continuous monitoring is too large for any physician to review directly. AI doesn't replace judgment — it makes judgment possible on data that would otherwise be inaccessible.
|
|
|
|
## Divergent Claims
|
|
|
|
### Human oversight degrades AI clinical performance
|
|
**File:** [[human-in-the-loop clinical AI degrades to worse-than-AI-alone because physicians both de-skill from reliance and introduce errors when overriding correct outputs]]
|
|
**Core argument:** Physicians systematically override correct AI outputs and lose independent diagnostic capability through reliance.
|
|
**Strongest evidence:** Stanford/Harvard study: AI alone 90%, doctors+AI 68%. Colonoscopy AI de-skilling after 3 months.
|
|
|
|
### AI middleware is essential for clinical data translation
|
|
**File:** [[AI middleware bridges consumer wearable data to clinical utility because continuous data is too voluminous for direct clinician review]]
|
|
**Core argument:** Continuous health monitoring generates data volumes that require AI processing before human review is even possible.
|
|
**Strongest evidence:** Mayo Clinic Apple Watch ECG integration; FHIR interoperability standards; data volume from continuous glucose monitors.
|
|
|
|
## What Would Resolve This
|
|
|
|
- **Task-type decomposition:** Does the degradation pattern hold for all clinical tasks, or only for diagnosis-type tasks where AI has clear ground truth? Monitoring/translation tasks may be structurally different.
|
|
- **Role-specific studies:** Does physician performance degrade when AI translates data (middleware role) as it does when AI diagnoses (replacement role)?
|
|
- **Longitudinal de-skilling:** Does the 3-month colonoscopy de-skilling effect persist, or do physicians recalibrate? Is it specific to visual pattern recognition?
|
|
- **Hybrid deployment data:** Are there implementations where AI handles diagnosis AND serves as data middleware, with physicians overseeing different functions at each layer?
|
|
|
|
## Cascade Impact
|
|
|
|
- If degradation dominates: AI should replace human judgment in verifiable diagnostic tasks. The physician role shifts entirely to relationship management and complex decision-making. Regulatory frameworks need redesign.
|
|
- If middleware is essential: AI augments rather than replaces. The physician remains in the loop but at a different layer — interpreting AI-processed insights rather than raw data or AI recommendations.
|
|
- If task-dependent: Both are right in their domain. The deployment model is: AI decides on pattern-recognition diagnostics, AI translates on continuous monitoring, physicians handle complex multi-factor clinical decisions. This would dissolve the divergence into scope.
|
|
|
|
**Cross-domain note:** The mode of human involvement may be the determining variable. Real-time oversight of individual AI outputs (where humans de-skill) is structurally different from adversarial challenge of published AI claims (where humans bring orthogonal priors). The clinical degradation finding is a domain-specific instance of the general oversight degradation pattern, but it may not apply to adversarial review architectures like the Teleo collective's contributor model.
|
|
|
|
---
|
|
|
|
Relevant Notes:
|
|
- [[the physician role shifts from information processor to relationship manager as AI automates documentation triage and evidence synthesis]] — the role shift both claims point toward
|
|
- [[medical LLM benchmark performance does not translate to clinical impact because physicians with and without AI access achieve similar diagnostic accuracy in randomized trials]] — additional evidence on the gap
|
|
- [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]] — general oversight degradation pattern that the clinical finding instantiates
|
|
|
|
Topics:
|
|
- [[_map]]
|
|
|
|
|
|
## Extending Evidence
|
|
|
|
**Source:** Oettl et al. 2026, Journal of Experimental Orthopaedics PMC12955832
|
|
|
|
Oettl et al. 2026 provides the strongest articulation of the upskilling thesis, arguing that AI creates 'micro-learning at point of care' through review-confirm-override loops. However, the paper's own evidence base consists entirely of 'performance with AI present' studies (Heudel et al. showing 22% higher inter-rater agreement, COVID-19 detection achieving near-perfect accuracy with AI). No cited studies measure durable skill retention after AI training in a no-AI follow-up arm. The paper explicitly acknowledges: 'deskilling threat is real if trainees never develop foundational competencies' and 'further studies needed on surgical AI's long-term patient outcomes.' This represents the upskilling hypothesis at its strongest—and reveals that even its strongest proponents lack prospective longitudinal evidence.
|
|
|
|
|
|
## Extending Evidence
|
|
|
|
**Source:** Heudel et al., Insights into Imaging, 2025 (PMC11780016)
|
|
|
|
Heudel et al. (2025) radiology study (n=8 residents, 150 chest X-rays) shows 22% improvement in inter-rater agreement (ICC-1: 0.665→0.813) and significant error reduction (p<0.001) WITH AI present. However, study design lacks post-training no-AI assessment, so it documents performance improvement during AI use, not durable skill retention. This is the primary empirical source cited by upskilling proponents (including Oettl 2026), but close reading reveals it only demonstrates AI-assisted performance, not independent upskilling. Residents showed 'resilience to AI errors above acceptability threshold' (maintaining ~2.75-2.88 error when AI made >3-point errors), suggesting some critical evaluation capacity persists during AI use.
|
|
|
|
|
|
## Extending Evidence
|
|
|
|
**Source:** Heudel et al., Insights into Imaging 2025 (PMC11780016)
|
|
|
|
Heudel et al. (2025) radiology study (n=8 residents, 150 chest X-rays) shows 22% improvement in inter-rater agreement (ICC-1: 0.665→0.813) and significant error reduction (p<0.001) when AI is present. However, the study design has NO post-training assessment without AI, meaning it documents 'performance improvement with AI present' rather than 'durable upskilling.' This is the methodological gap at the core of the divergence: upskilling-thesis studies measure performance WITH AI, while deskilling-evidence studies (colonoscopy ADR 28.4%→22.4%, radiology false positives +12%) measure performance AFTER AI removal. The study does show residents can detect large AI errors (>3 points) while maintaining average errors around 2.75-2.88, suggesting some resilience to major AI failures, but this occurs only while AI remains present.
|
|
|
|
|
|
## Extending Evidence
|
|
|
|
**Source:** Heudel et al., Insights into Imaging, Jan 2025 (PMC11780016)
|
|
|
|
Heudel et al. (2025) radiology study (n=8 residents, 150 chest X-rays) shows 22% improvement in inter-rater agreement (ICC-1: 0.665→0.813) and significant error reduction (p<0.001) when AI is present. However, the study does NOT test post-training performance without AI—it only documents improved performance WHILE AI IS PRESENT. This is the methodological gap in the 'upskilling' literature: no evidence of durable skill retention after AI-assisted training ends. The study does show residents can reject major AI errors (>3 points), maintaining ~2.75-2.88 average error when AI makes large mistakes, suggesting some critical evaluation capacity persists during AI use.
|