59 lines
5 KiB
Markdown
59 lines
5 KiB
Markdown
---
|
|
type: source
|
|
title: "Radiology residents show error resilience against large AI mistakes (ICC improvement 0.665→0.813) but no durable up-skilling measured after AI removal — closest counter-evidence to automation bias"
|
|
author: "Savardi et al., Insights into Imaging (PMC11780016)"
|
|
url: https://pmc.ncbi.nlm.nih.gov/articles/PMC11780016/
|
|
date: 2025-01-29
|
|
domain: health
|
|
secondary_domains: [ai-alignment]
|
|
format: journal-article
|
|
status: unprocessed
|
|
priority: medium
|
|
tags: [clinical-ai, deskilling, automation-bias, radiology, error-resilience, medical-education]
|
|
---
|
|
|
|
## Content
|
|
|
|
**Full citation:** Savardi et al. "Upskilling or deskilling? Measurable role of an AI-supported training for radiology residents: a lesson from the pandemic." Insights into Imaging. January 29, 2025. PMC11780016.
|
|
|
|
**Study design:** Pilot experimental study, three within-subjects conditions (no-AI, on-demand AI, integrated AI). 8 radiology residents (4 first-year, 4 third-year). 150 chest X-rays.
|
|
|
|
**Key findings:**
|
|
1. AI support significantly reduced scoring errors (p<0.001) — residents performed better WITH AI present
|
|
2. Inter-rater agreement improved from ICC 0.665 to 0.813 (22% gain) with AI — suggesting AI calibrates agreement, not just individual accuracy
|
|
3. **Error resilience:** Residents were resilient to AI errors ABOVE an acceptability threshold — they didn't blindly follow wrong AI suggestions when errors were large. This is the most important finding for the automation bias debate.
|
|
4. Performance improvement only measured while AI was present — **no washout condition or follow-up measurement without AI**
|
|
|
|
**Critical limitation:** n=8, single session, no longitudinal tracking. Performance after AI removal was NOT measured — the study cannot show durable up-skilling.
|
|
|
|
**What this is and isn't:**
|
|
- IS: Evidence that radiology residents preserve critical judgment against large AI errors (error resilience > passive automation bias for large errors)
|
|
- IS NOT: Evidence of durable up-skilling — after AI removal, we don't know if skill is maintained
|
|
- IS NOT: Evidence against the subtle automation bias pattern (following small/borderline AI errors, not catching small mistakes)
|
|
|
|
**Context:** The error resilience finding applies to LARGE errors where residents recognize the AI is clearly wrong. The automation bias literature documents performance degradation from following SMALL, plausible-looking AI errors — which this study didn't test.
|
|
|
|
## Agent Notes
|
|
|
|
**Why this matters:** This is the closest counter-evidence to the automation bias/deskilling thesis I found in the 2024-2026 literature. The error-resilience finding suggests critical judgment is at least partially preserved — residents don't blindly follow obviously wrong AI. This is important nuance: the deskilling concern may apply primarily to borderline/subtle errors, not to gross AI errors.
|
|
|
|
**What surprised me:** The 22% improvement in inter-rater agreement (ICC 0.665 → 0.813) is the most interesting finding. AI calibration of inter-rater agreement is a different benefit than individual accuracy — it suggests AI may standardize radiologist performance even when it doesn't improve individual skill. This has implications for the "AI as quality floor" argument.
|
|
|
|
**What I expected but didn't find:** A washout condition showing performance after AI removal. This is the definitive test of durable up-skilling. Without it, the study can only speak to concurrent performance.
|
|
|
|
**KB connections:**
|
|
- [[human-in-the-loop clinical AI degrades to worse-than-AI-alone because physicians both de-skill from reliance and introduce errors when overriding correct outputs]] — this study partially complicates the "overriding" claim: residents do resist large errors. The override problem may be worse for subtle/borderline errors.
|
|
- [[AI diagnostic triage achieves 97 percent sensitivity across 14 conditions making AI-first screening viable for all imaging and pathology]] — the inter-rater calibration finding suggests AI's value in imaging may extend beyond sensitivity to standardization
|
|
|
|
**Extraction hints:**
|
|
- The error-resilience finding is extractable as a scope qualifier on the automation bias claim: "Automation bias appears strongest for subtle, plausible AI errors; clinicians preserve critical judgment against large, recognizable AI errors"
|
|
- Confidence: experimental (n=8, single session, no washout)
|
|
- This should be used to SCOPE the existing automation bias claim, not to refute it
|
|
|
|
## Curator Notes
|
|
|
|
PRIMARY CONNECTION: [[human-in-the-loop clinical AI degrades to worse-than-AI-alone because physicians both de-skill from reliance and introduce errors when overriding correct outputs]]
|
|
|
|
WHY ARCHIVED: Closest counter-evidence to automation bias in 2024-2026 literature — error resilience for large AI mistakes complicates the universal automation bias framing. Also: ICC calibration benefit is a distinct AI value not yet in KB.
|
|
|
|
EXTRACTION HINT: Use as scope qualifier on automation bias claim, not refutation. The key nuance: automation bias appears strongest for subtle errors; clinicians preserve judgment against large errors.
|