teleo-codex/domains/health/clinical-ai-hallucination-rates-vary-100x-by-task-making-single-regulatory-thresholds-operationally-inadequate.md
m3taversal be8ff41bfe link: bidirectional source↔claim index — 414 claims + 252 sources connected
Wrote sourced_from: into 414 claim files pointing back to their origin source.
Backfilled claims_extracted: into 252 source files that were processed but
missing this field. Matching uses author+title overlap against claim source:
field, validated against 296 known-good pairs from existing claims_extracted.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-21 11:55:18 +01:00

2.7 KiB

type domain description confidence source created title agent scope sourcer related_claims supports reweave_edges sourced_from
claim health Hallucination rates range from 1.47% for structured transcription to 64.1% for open-ended summarization demonstrating that task-specific benchmarking is required experimental npj Digital Medicine 2025, empirical testing across multiple clinical AI tasks 2026-04-03 Clinical AI hallucination rates vary 100x by task making single regulatory thresholds operationally inadequate vida structural npj Digital Medicine
AI scribes reached 92 percent provider adoption in under 3 years because documentation is the rare healthcare workflow where AI value is immediate unambiguous and low-risk
healthcare AI regulation needs blank-sheet redesign because the FDA drug-and-device model built for static products cannot govern continuously learning software
No regulatory body globally has established mandatory hallucination rate benchmarks for clinical AI despite evidence base and proposed frameworks
clinical-ai-errors-are-76-percent-omissions-not-commissions-inverting-the-hallucination-safety-model
No regulatory body globally has established mandatory hallucination rate benchmarks for clinical AI despite evidence base and proposed frameworks|supports|2026-04-04
clinical-ai-errors-are-76-percent-omissions-not-commissions-inverting-the-hallucination-safety-model|supports|2026-04-07
inbox/archive/health/2026-03-22-cognitive-bias-clinical-llm-npj-digital-medicine.md

Clinical AI hallucination rates vary 100x by task making single regulatory thresholds operationally inadequate

Empirical testing reveals clinical AI hallucination rates span a 100x range depending on task complexity: ambient scribes (structured transcription) achieve 1.47% hallucination rates, while clinical case summarization without mitigation reaches 64.1%. GPT-4o with structured mitigation drops from 53% to 23%, and GPT-5 with thinking mode achieves 1.6% on HealthBench. This variation exists because structured, constrained tasks (transcription) have clear ground truth and limited generation space, while open-ended tasks (summarization, clinical reasoning) require synthesis across ambiguous information with no single correct output. The 100x range demonstrates that a single regulatory threshold—such as 'all clinical AI must have <5% hallucination rate'—is operationally meaningless because it would either permit dangerous applications (64.1% summarization) or prohibit safe ones (1.47% transcription) depending on where the threshold is set. Task-specific benchmarking is the only viable regulatory approach, yet no framework currently requires it.