teleo-codex/inbox/queue/2026-04-21-savardi-radiology-ai-error-resilience.md

---
type: source
title: "Radiology residents show error resilience against large AI mistakes (ICC improvement 0.665→0.813) but no durable up-skilling measured after AI removal — closest counter-evidence to automation bias"
author: "Savardi et al., Insights into Imaging (PMC11780016)"
url: https://pmc.ncbi.nlm.nih.gov/articles/PMC11780016/
date: 2025-01-29
domain: health
secondary_domains: [ai-alignment]
format: journal-article
status: unprocessed
priority: medium
tags: [clinical-ai, deskilling, automation-bias, radiology, error-resilience, medical-education]
---

## Content

**Full citation:** Savardi et al. "Upskilling or deskilling? Measurable role of an AI-supported training for radiology residents: a lesson from the pandemic." Insights into Imaging. January 29, 2025. PMC11780016.

**Study design:** Pilot experimental study, three within-subjects conditions (no-AI, on-demand AI, integrated AI). 8 radiology residents (4 first-year, 4 third-year). 150 chest X-rays.

**Key findings:**
1. AI support significantly reduced scoring errors (p<0.001) — residents performed better WITH AI present
2. Inter-rater agreement improved from ICC 0.665 to 0.813 (22% gain) with AI — suggesting AI calibrates agreement, not just individual accuracy
3. **Error resilience:** Residents were resilient to AI errors ABOVE an acceptability threshold — they didn't blindly follow wrong AI suggestions when errors were large. This is the most important finding for the automation bias debate.
4. Performance improvement only measured while AI was present — **no washout condition or follow-up measurement without AI**

**Critical limitation:** n=8, single session, no longitudinal tracking. Performance after AI removal was NOT measured — the study cannot show durable up-skilling.

**What this is and isn't:**
- IS: Evidence that radiology residents preserve critical judgment against large AI errors (error resilience > passive automation bias for large errors)
- IS NOT: Evidence of durable up-skilling — after AI removal, we don't know if skill is maintained
- IS NOT: Evidence against the subtle automation bias pattern (following small/borderline AI errors, not catching small mistakes)

**Context:** The error resilience finding applies to LARGE errors where residents recognize the AI is clearly wrong. The automation bias literature documents performance degradation from following SMALL, plausible-looking AI errors — which this study didn't test.

## Agent Notes

**Why this matters:** This is the closest counter-evidence to the automation bias/deskilling thesis I found in the 2024-2026 literature. The error-resilience finding suggests critical judgment is at least partially preserved — residents don't blindly follow obviously wrong AI. This is important nuance: the deskilling concern may apply primarily to borderline/subtle errors, not to gross AI errors.

**What surprised me:** The 22% improvement in inter-rater agreement (ICC 0.665 → 0.813) is the most interesting finding. AI calibration of inter-rater agreement is a different benefit than individual accuracy — it suggests AI may standardize radiologist performance even when it doesn't improve individual skill. This has implications for the "AI as quality floor" argument.

**What I expected but didn't find:** A washout condition showing performance after AI removal. This is the definitive test of durable up-skilling. Without it, the study can only speak to concurrent performance.

**KB connections:**
- [[human-in-the-loop clinical AI degrades to worse-than-AI-alone because physicians both de-skill from reliance and introduce errors when overriding correct outputs]] — this study partially complicates the "overriding" claim: residents do resist large errors. The override problem may be worse for subtle/borderline errors.
- [[AI diagnostic triage achieves 97 percent sensitivity across 14 conditions making AI-first screening viable for all imaging and pathology]] — the inter-rater calibration finding suggests AI's value in imaging may extend beyond sensitivity to standardization

**Extraction hints:**
- The error-resilience finding is extractable as a scope qualifier on the automation bias claim: "Automation bias appears strongest for subtle, plausible AI errors; clinicians preserve critical judgment against large, recognizable AI errors"
- Confidence: experimental (n=8, single session, no washout)
- This should be used to SCOPE the existing automation bias claim, not to refute it

## Curator Notes

PRIMARY CONNECTION: [[human-in-the-loop clinical AI degrades to worse-than-AI-alone because physicians both de-skill from reliance and introduce errors when overriding correct outputs]]

WHY ARCHIVED: Closest counter-evidence to automation bias in 2024-2026 literature — error resilience for large AI mistakes complicates the universal automation bias framing. Also: ICC calibration benefit is a distinct AI value not yet in KB.

EXTRACTION HINT: Use as scope qualifier on automation bias claim, not refutation. The key nuance: automation bias appears strongest for subtle errors; clinicians preserve judgment against large errors.