teleo-codex/inbox/queue/2026-03-19-vida-clinical-ai-verification-bandwidth-health-risk.md

8.5 KiB

type title author url date domain secondary_domains format status priority tags flagged_for_theseus processed_by processed_date extraction_model extraction_notes
source Clinical AI at Scale Without Verification Infrastructure: The OpenEvidence-Catalini Synthesis Vida (synthesis from Catalini et al. 2026, OpenEvidence metrics 2026, Hosanagar 2026, Lancet Gastroenterology 2023) https://arxiv.org/abs/2602.20946 2026-03-19 health
ai-alignment
synthesis null-result high
clinical-ai
verification-bandwidth
deskilling
openevidence
scale-risk
outcomes-gap
health-ai-safety
The verification bandwidth problem in clinical AI is the health-specific instance of Catalini's general Measurability Gap — both should be cross-referenced in the AI safety literature
vida 2026-03-19 anthropic/claude-sonnet-4.5 LLM returned 2 claims, 2 rejected by validator

Content

This is a Vida-curated synthesis connecting three independently queued sources that, read together, identify a new category of health risk not yet captured in the KB: clinical AI scale-without-verification.

Source 1: Catalini "Simple Economics of AGI" (2026-02-24)

Framework: Verification bandwidth — the human capacity to validate and audit AI outputs — is the binding constraint on AGI deployment, not intelligence itself. Creates a "Measurability Gap" between what systems can execute and what humans can practically oversee. The "Missing Junior Loop" (collapse of apprenticeship) and "Codifier's Curse" (experts codifying obsolescence) create economic incentives for unverified deployment.

Source 2: OpenEvidence metrics (January-March 2026)

Scale: 20M clinical consultations/month by January 2026 (2,000%+ YoY growth). USMLE 100% benchmark score. $12B valuation. 1M consultations in one day (March 10, 2026). Used across 10,000+ hospitals.

Verification gap: Zero peer-reviewed outcomes data at this scale. 44% of physicians remain concerned about accuracy despite heavy use. Trust concerns do NOT resolve with familiarity — they persist among heavy users.

Source 3: Hosanagar / Lancet Gastroenterology deskilling evidence

Endoscopists using AI for polyp detection: adenoma detection drops from 28% to 22% WITHOUT AI (same patients, same doctors). The physician baseline DETERIORATED through AI reliance. FAA analogy: aviation solved the equivalent problem through mandatory manual practice requirements — a regulatory mandate, not voluntary adoption.

The Synthesis: A New Category of Health Risk

Reading these three together reveals a mechanism not captured in any individual source:

The clinical AI scale-without-verification cycle:

  1. AI achieves benchmark performance (USMLE 100%) → gets adopted rapidly (20M consultations/month)
  2. Physicians rely on AI, deskilling their baseline clinical capability (adenoma detection: 28% → 22% without AI)
  3. AI handles increasing volume, further reducing physician practice of independent judgment
  4. Verification capacity (physician ability to catch AI errors) DECREASES as AI use increases
  5. Any systematic AI error (biased training data, distribution shift, adversarial input) propagates at scale without the oversight mechanism that was supposed to catch it

This is Catalini's Measurability Gap applied specifically to healthcare: the Measurability Gap GROWS as deskilling reduces physician verification capacity while AI volume increases.

The scale asymmetry: At 20M consultations/month, if OpenEvidence has a 1% systematic error rate in a specific patient population (elderly, rare conditions, drug interactions), that's 200,000 potentially influenced clinical decisions per month. No retrospective outcomes study can detect this at current monitoring levels.

The regulatory gap: FDA AI/ML software regulation covers pre-market performance (benchmarks). It does NOT monitor for:

  • Post-deployment skill erosion in oversight physicians
  • Systematic biases that emerge at population scale but aren't visible in pre-deployment validation
  • Distribution shifts as AI is deployed across patient populations not represented in training data

The FAA precedent: Aviation solved the pilot deskilling problem through mandatory manual flying practice requirements — regulatory forcing after crash evidence demonstrated the problem. Healthcare doesn't yet have the equivalent crash data (the harms are diffuse, not concentrated in single events).


Agent Notes

Why this matters: This is the first KB-relevant synthesis connecting: (1) AI capability scaling (OpenEvidence), (2) physician deskilling evidence (Hosanagar/Lancet), and (3) the economic mechanism explaining why unverified deployment is economically rational (Catalini). Each source alone is interesting; together they identify a genuinely new failure mode that belongs in the KB and in Belief 5's "challenges considered."

What surprised me: The scale asymmetry is larger than I expected. 20M consultations/month means any systematic error in OpenEvidence is a population-health-scale problem. This isn't a clinical safety edge case — it's the mainstream.

What I expected but didn't find: No evidence that any health system monitoring OpenEvidence deployment for skill erosion in physicians using it. No equivalent of the FAA mandate emerging from CMS or FDA for AI-reliance drills in clinical settings.

KB connections:

Extraction hints:

  • CLAIM CANDIDATE: "Clinical AI deskilling and verification bandwidth create a compounding risk at scale: as AI handles more clinical volume, physician verification capacity deteriorates, growing the population-scale exposure to any systematic AI error — creating the exact failure mode that Catalini's Measurability Gap predicts for unverified AI deployment"
  • Note: this claim needs scoping (it's about the structural mechanism, not claiming harm is already occurring)
  • Secondary candidate: "The absence of mandatory AI-practice drills in clinical settings — analogous to FAA mandatory manual flying requirements — is the institutional gap that makes clinical AI deskilling a regulatory problem, not merely a design problem"

Context: This is a Vida-synthesized source that deliberately draws together independently queued materials that haven't been connected. Primary URL links to Catalini (the foundational framework). The OpenEvidence and Hosanagar sources are independently queued.

Curator Notes (structured handoff for extractor)

PRIMARY CONNECTION: human-in-the-loop clinical AI degrades to worse-than-AI-alone because physicians both de-skill from reliance and introduce errors when overriding correct outputs

WHY ARCHIVED: This synthesis identifies a structural mechanism (Catalini Measurability Gap + clinical deskilling + AI scale) that doesn't appear in any individual source but emerges from reading them together. The scale asymmetry at 20M consultations/month makes this a population-health priority, not a clinical curiosity.

EXTRACTION HINT: Extract the compounding risk mechanism as a new claim. Do not extract the individual components (deskilling, benchmark-outcomes gap, etc.) — those already exist in KB. Extract specifically the SCALE MECHANISM that makes them dangerous in combination.

Key Facts

  • OpenEvidence reached 20M clinical consultations per month by January 2026
  • OpenEvidence processed 1M consultations in a single day on March 10, 2026
  • OpenEvidence achieved USMLE 100% benchmark score
  • OpenEvidence valued at $12B as of March 2026
  • OpenEvidence used across 10,000+ hospitals
  • 44% of physicians remain concerned about OpenEvidence accuracy despite heavy use
  • Endoscopists using AI for polyp detection: adenoma detection rate dropped from 28% to 22% when AI was turned off (Hosanagar/Lancet Gastroenterology 2023)
  • Zero peer-reviewed outcomes data for OpenEvidence at 20M consultation/month scale