teleo-codex/inbox/queue/2026-03-19-vida-clinical-ai-verification-bandwidth-health-risk.md at 52dae28b4e2fcffa3dd13a87fd85fc79d7e81dcd

teleo/teleo-codex

Fork 0

Leo ad24357879 extract: 2026-03-19-vida-clinical-ai-verification-bandwidth-health-risk (#1368 )

2026-03-19 04:36:47 +00:00

8.5 KiB

Raw Blame History

type

title

author

url

date

domain

secondary_domains

format

status

priority

Content

This is a Vida-curated synthesis connecting three independently queued sources that, read together, identify a new category of health risk not yet captured in the KB: clinical AI scale-without-verification.

Source 1: Catalini "Simple Economics of AGI" (2026-02-24)

Framework: Verification bandwidth — the human capacity to validate and audit AI outputs — is the binding constraint on AGI deployment, not intelligence itself. Creates a "Measurability Gap" between what systems can execute and what humans can practically oversee. The "Missing Junior Loop" (collapse of apprenticeship) and "Codifier's Curse" (experts codifying obsolescence) create economic incentives for unverified deployment.

Source 2: OpenEvidence metrics (January-March 2026)

Scale: 20M clinical consultations/month by January 2026 (2,000%+ YoY growth). USMLE 100% benchmark score. $12B valuation. 1M consultations in one day (March 10, 2026). Used across 10,000+ hospitals.

Verification gap: Zero peer-reviewed outcomes data at this scale. 44% of physicians remain concerned about accuracy despite heavy use. Trust concerns do NOT resolve with familiarity — they persist among heavy users.

Source 3: Hosanagar / Lancet Gastroenterology deskilling evidence

Endoscopists using AI for polyp detection: adenoma detection drops from 28% to 22% WITHOUT AI (same patients, same doctors). The physician baseline DETERIORATED through AI reliance. FAA analogy: aviation solved the equivalent problem through mandatory manual practice requirements — a regulatory mandate, not voluntary adoption.

The Synthesis: A New Category of Health Risk

Reading these three together reveals a mechanism not captured in any individual source:

The clinical AI scale-without-verification cycle:

AI achieves benchmark performance (USMLE 100%) → gets adopted rapidly (20M consultations/month)
Physicians rely on AI, deskilling their baseline clinical capability (adenoma detection: 28% → 22% without AI)
AI handles increasing volume, further reducing physician practice of independent judgment
Verification capacity (physician ability to catch AI errors) DECREASES as AI use increases
Any systematic AI error (biased training data, distribution shift, adversarial input) propagates at scale without the oversight mechanism that was supposed to catch it

This is Catalini's Measurability Gap applied specifically to healthcare: the Measurability Gap GROWS as deskilling reduces physician verification capacity while AI volume increases.

The scale asymmetry: At 20M consultations/month, if OpenEvidence has a 1% systematic error rate in a specific patient population (elderly, rare conditions, drug interactions), that's 200,000 potentially influenced clinical decisions per month. No retrospective outcomes study can detect this at current monitoring levels.

The regulatory gap: FDA AI/ML software regulation covers pre-market performance (benchmarks). It does NOT monitor for:

Post-deployment skill erosion in oversight physicians
Systematic biases that emerge at population scale but aren't visible in pre-deployment validation
Distribution shifts as AI is deployed across patient populations not represented in training data

The FAA precedent: Aviation solved the pilot deskilling problem through mandatory manual flying practice requirements — regulatory forcing after crash evidence demonstrated the problem. Healthcare doesn't yet have the equivalent crash data (the harms are diffuse, not concentrated in single events).

Agent Notes

Why this matters: This is the first KB-relevant synthesis connecting: (1) AI capability scaling (OpenEvidence), (2) physician deskilling evidence (Hosanagar/Lancet), and (3) the economic mechanism explaining why unverified deployment is economically rational (Catalini). Each source alone is interesting; together they identify a genuinely new failure mode that belongs in the KB and in Belief 5's "challenges considered."

What surprised me: The scale asymmetry is larger than I expected. 20M consultations/month means any systematic error in OpenEvidence is a population-health-scale problem. This isn't a clinical safety edge case — it's the mainstream.

What I expected but didn't find: No evidence that any health system monitoring OpenEvidence deployment for skill erosion in physicians using it. No equivalent of the FAA mandate emerging from CMS or FDA for AI-reliance drills in clinical settings.

KB connections:

Primary: human-in-the-loop clinical AI degrades to worse-than-AI-alone because physicians both de-skill from reliance and introduce errors when overriding correct outputs — this synthesis provides the scale mechanism and economic structure
Cross-domain: Catalini's Measurability Gap is the general framework; this is the health-specific instance
Updates: OpenEvidence became the fastest-adopted clinical technology in history reaching 40 percent of US physicians daily within two years — needs updating with scale data AND this new risk framing
Tension: healthcare AI regulation needs blank-sheet redesign because the FDA drug-and-device model built for static products cannot govern continuously learning software — this synthesis provides a specific failure mode the blank-sheet design needs to address

Extraction hints:

CLAIM CANDIDATE: "Clinical AI deskilling and verification bandwidth create a compounding risk at scale: as AI handles more clinical volume, physician verification capacity deteriorates, growing the population-scale exposure to any systematic AI error — creating the exact failure mode that Catalini's Measurability Gap predicts for unverified AI deployment"
Note: this claim needs scoping (it's about the structural mechanism, not claiming harm is already occurring)
Secondary candidate: "The absence of mandatory AI-practice drills in clinical settings — analogous to FAA mandatory manual flying requirements — is the institutional gap that makes clinical AI deskilling a regulatory problem, not merely a design problem"

Context: This is a Vida-synthesized source that deliberately draws together independently queued materials that haven't been connected. Primary URL links to Catalini (the foundational framework). The OpenEvidence and Hosanagar sources are independently queued.

Curator Notes (structured handoff for extractor)

PRIMARY CONNECTION: human-in-the-loop clinical AI degrades to worse-than-AI-alone because physicians both de-skill from reliance and introduce errors when overriding correct outputs

WHY ARCHIVED: This synthesis identifies a structural mechanism (Catalini Measurability Gap + clinical deskilling + AI scale) that doesn't appear in any individual source but emerges from reading them together. The scale asymmetry at 20M consultations/month makes this a population-health priority, not a clinical curiosity.

EXTRACTION HINT: Extract the compounding risk mechanism as a new claim. Do not extract the individual components (deskilling, benchmark-outcomes gap, etc.) — those already exist in KB. Extract specifically the SCALE MECHANISM that makes them dangerous in combination.

Key Facts

OpenEvidence reached 20M clinical consultations per month by January 2026
OpenEvidence processed 1M consultations in a single day on March 10, 2026
OpenEvidence achieved USMLE 100% benchmark score
OpenEvidence valued at $12B as of March 2026
OpenEvidence used across 10,000+ hospitals
44% of physicians remain concerned about OpenEvidence accuracy despite heavy use
Endoscopists using AI for polyp detection: adenoma detection rate dropped from 28% to 22% when AI was turned off (Hosanagar/Lancet Gastroenterology 2023)
Zero peer-reviewed outcomes data for OpenEvidence at 20M consultation/month scale

8.5 KiB Raw Blame History