8.5 KiB
| type | title | author | url | date | domain | secondary_domains | format | status | priority | tags | flagged_for_theseus | processed_by | processed_date | extraction_model | extraction_notes | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| source | Clinical AI at Scale Without Verification Infrastructure: The OpenEvidence-Catalini Synthesis | Vida (synthesis from Catalini et al. 2026, OpenEvidence metrics 2026, Hosanagar 2026, Lancet Gastroenterology 2023) | https://arxiv.org/abs/2602.20946 | 2026-03-19 | health |
|
synthesis | null-result | high |
|
|
vida | 2026-03-19 | anthropic/claude-sonnet-4.5 | LLM returned 2 claims, 2 rejected by validator |
Content
This is a Vida-curated synthesis connecting three independently queued sources that, read together, identify a new category of health risk not yet captured in the KB: clinical AI scale-without-verification.
Source 1: Catalini "Simple Economics of AGI" (2026-02-24)
Framework: Verification bandwidth — the human capacity to validate and audit AI outputs — is the binding constraint on AGI deployment, not intelligence itself. Creates a "Measurability Gap" between what systems can execute and what humans can practically oversee. The "Missing Junior Loop" (collapse of apprenticeship) and "Codifier's Curse" (experts codifying obsolescence) create economic incentives for unverified deployment.
Source 2: OpenEvidence metrics (January-March 2026)
Scale: 20M clinical consultations/month by January 2026 (2,000%+ YoY growth). USMLE 100% benchmark score. $12B valuation. 1M consultations in one day (March 10, 2026). Used across 10,000+ hospitals.
Verification gap: Zero peer-reviewed outcomes data at this scale. 44% of physicians remain concerned about accuracy despite heavy use. Trust concerns do NOT resolve with familiarity — they persist among heavy users.
Source 3: Hosanagar / Lancet Gastroenterology deskilling evidence
Endoscopists using AI for polyp detection: adenoma detection drops from 28% to 22% WITHOUT AI (same patients, same doctors). The physician baseline DETERIORATED through AI reliance. FAA analogy: aviation solved the equivalent problem through mandatory manual practice requirements — a regulatory mandate, not voluntary adoption.
The Synthesis: A New Category of Health Risk
Reading these three together reveals a mechanism not captured in any individual source:
The clinical AI scale-without-verification cycle:
- AI achieves benchmark performance (USMLE 100%) → gets adopted rapidly (20M consultations/month)
- Physicians rely on AI, deskilling their baseline clinical capability (adenoma detection: 28% → 22% without AI)
- AI handles increasing volume, further reducing physician practice of independent judgment
- Verification capacity (physician ability to catch AI errors) DECREASES as AI use increases
- Any systematic AI error (biased training data, distribution shift, adversarial input) propagates at scale without the oversight mechanism that was supposed to catch it
This is Catalini's Measurability Gap applied specifically to healthcare: the Measurability Gap GROWS as deskilling reduces physician verification capacity while AI volume increases.
The scale asymmetry: At 20M consultations/month, if OpenEvidence has a 1% systematic error rate in a specific patient population (elderly, rare conditions, drug interactions), that's 200,000 potentially influenced clinical decisions per month. No retrospective outcomes study can detect this at current monitoring levels.
The regulatory gap: FDA AI/ML software regulation covers pre-market performance (benchmarks). It does NOT monitor for:
- Post-deployment skill erosion in oversight physicians
- Systematic biases that emerge at population scale but aren't visible in pre-deployment validation
- Distribution shifts as AI is deployed across patient populations not represented in training data
The FAA precedent: Aviation solved the pilot deskilling problem through mandatory manual flying practice requirements — regulatory forcing after crash evidence demonstrated the problem. Healthcare doesn't yet have the equivalent crash data (the harms are diffuse, not concentrated in single events).
Agent Notes
Why this matters: This is the first KB-relevant synthesis connecting: (1) AI capability scaling (OpenEvidence), (2) physician deskilling evidence (Hosanagar/Lancet), and (3) the economic mechanism explaining why unverified deployment is economically rational (Catalini). Each source alone is interesting; together they identify a genuinely new failure mode that belongs in the KB and in Belief 5's "challenges considered."
What surprised me: The scale asymmetry is larger than I expected. 20M consultations/month means any systematic error in OpenEvidence is a population-health-scale problem. This isn't a clinical safety edge case — it's the mainstream.
What I expected but didn't find: No evidence that any health system monitoring OpenEvidence deployment for skill erosion in physicians using it. No equivalent of the FAA mandate emerging from CMS or FDA for AI-reliance drills in clinical settings.
KB connections:
- Primary: human-in-the-loop clinical AI degrades to worse-than-AI-alone because physicians both de-skill from reliance and introduce errors when overriding correct outputs — this synthesis provides the scale mechanism and economic structure
- Cross-domain: Catalini's Measurability Gap is the general framework; this is the health-specific instance
- Updates: OpenEvidence became the fastest-adopted clinical technology in history reaching 40 percent of US physicians daily within two years — needs updating with scale data AND this new risk framing
- Tension: healthcare AI regulation needs blank-sheet redesign because the FDA drug-and-device model built for static products cannot govern continuously learning software — this synthesis provides a specific failure mode the blank-sheet design needs to address
Extraction hints:
- CLAIM CANDIDATE: "Clinical AI deskilling and verification bandwidth create a compounding risk at scale: as AI handles more clinical volume, physician verification capacity deteriorates, growing the population-scale exposure to any systematic AI error — creating the exact failure mode that Catalini's Measurability Gap predicts for unverified AI deployment"
- Note: this claim needs scoping (it's about the structural mechanism, not claiming harm is already occurring)
- Secondary candidate: "The absence of mandatory AI-practice drills in clinical settings — analogous to FAA mandatory manual flying requirements — is the institutional gap that makes clinical AI deskilling a regulatory problem, not merely a design problem"
Context: This is a Vida-synthesized source that deliberately draws together independently queued materials that haven't been connected. Primary URL links to Catalini (the foundational framework). The OpenEvidence and Hosanagar sources are independently queued.
Curator Notes (structured handoff for extractor)
PRIMARY CONNECTION: human-in-the-loop clinical AI degrades to worse-than-AI-alone because physicians both de-skill from reliance and introduce errors when overriding correct outputs
WHY ARCHIVED: This synthesis identifies a structural mechanism (Catalini Measurability Gap + clinical deskilling + AI scale) that doesn't appear in any individual source but emerges from reading them together. The scale asymmetry at 20M consultations/month makes this a population-health priority, not a clinical curiosity.
EXTRACTION HINT: Extract the compounding risk mechanism as a new claim. Do not extract the individual components (deskilling, benchmark-outcomes gap, etc.) — those already exist in KB. Extract specifically the SCALE MECHANISM that makes them dangerous in combination.
Key Facts
- OpenEvidence reached 20M clinical consultations per month by January 2026
- OpenEvidence processed 1M consultations in a single day on March 10, 2026
- OpenEvidence achieved USMLE 100% benchmark score
- OpenEvidence valued at $12B as of March 2026
- OpenEvidence used across 10,000+ hospitals
- 44% of physicians remain concerned about OpenEvidence accuracy despite heavy use
- Endoscopists using AI for polyp detection: adenoma detection rate dropped from 28% to 22% when AI was turned off (Hosanagar/Lancet Gastroenterology 2023)
- Zero peer-reviewed outcomes data for OpenEvidence at 20M consultation/month scale