teleo-codex/domains/health/human-in-the-loop clinical AI degrades to worse-than-AI-alone because physicians both de-skill from reliance and introduce errors when overriding correct outputs.md
m3taversal 673c751b76
leo: foundations audit — 7 moves, 4 deletes, 3 condensations, 10 confidence demotions, 23 type fixes, 1 centaur rewrite
## Summary
Comprehensive audit of all 86 foundation claims across 4 subdomains.

**Changes:**
- 7 claims moved (3 → domains/ai-alignment/, 3 → core/teleohumanity/, 1 → domains/health/)
- 4 claims deleted (1 duplicate, 3 condensed into stronger claims)
- 3 condensations: cognitive limits 3→2, Christensen 4→2
- 10 confidence demotions (proven→likely for interpretive framings)
- 23 type fixes (framework/insight/pattern → claim per schema)
- 1 centaur rewrite (unconditional → conditional on role complementarity)
- All broken wiki links fixed across repo

**Review:** All 4 domain agents approved (Rio, Clay, Vida, Theseus).

Pentagon-Agent: Leo <76FB9BCA-CC16-4479-B3E5-25A3769B3D7E>
2026-03-07 11:56:38 -07:00

4.1 KiB

description type domain created source confidence
Stanford-Harvard study shows AI alone 90 percent vs doctors plus AI 68 percent vs doctors alone 65 percent and a colonoscopy study found experienced gastroenterologists measurably de-skilled after just three months with AI assistance claim health 2026-02-18 DJ Patil interviewing Bob Wachter, Commonwealth Club, February 9 2026; Stanford/Harvard diagnostic accuracy study; European colonoscopy AI de-skilling study likely

human-in-the-loop clinical AI degrades to worse-than-AI-alone because physicians both de-skill from reliance and introduce errors when overriding correct outputs

The human-in-the-loop model -- where AI suggests and humans verify -- is the default safety architecture for clinical AI. But two lines of evidence suggest this model is fundamentally flawed rather than merely imperfect.

The override problem. A Stanford/Harvard study tested physicians diagnosing complex clinical scenarios: doctors alone achieved 65% accuracy, doctors with AI access achieved 68%, and AI alone achieved 90%. The physician's input actually degraded the AI's performance by 22 percentage points. When physicians override correct AI outputs based on intuition or incomplete reasoning, they introduce systematic errors that negate the tool's accuracy advantage. As Wachter's wife put it: "You thought you were smarter than Google Maps."

The de-skilling problem. A European study gave gastroenterologists access to an AI colonoscopy tool that highlights suspicious lesions with green boxes. After just three months of use, the gastroenterologists' unaided performance was measurably worse than before they started using the tool. These were not trainees -- the average had ten years of experience doing the procedure. Three months of AI assistance eroded a decade of skill.

These findings create a genuine paradox for clinical AI deployment. The system designed for safety -- human oversight of AI -- may be less safe than autonomous AI operation. But autonomous AI in medicine is politically and ethically untenable given current error rates and the stakes involved. The resolution may require rethinking the interaction model entirely: rather than humans verifying AI outputs, perhaps AI should verify human outputs, or the two should process independently with disagreements flagged for deeper review.

Wachter frames the challenge directly: "Humans suck at remaining vigilant over time in the face of an AI tool." The Tesla parallel is apt -- a system called "self-driving" that requires constant human attention produces 100+ fatalities from the predictable failure of that attention. Healthcare's "physician-in-the-loop" model faces the same fundamental human factors constraint.


Relevant Notes:

Topics:

  • health and wellness