teleo-codex/domains/health/clinical-ai-hallucination-rates-vary-100x-by-task-making-single-regulatory-thresholds-operationally-inadequate.md
Teleo Agents 62273c09a5
Some checks are pending
Sync Graph Data to teleo-app / sync (push) Waiting to run
reweave: merge 42 files via frontmatter union [auto]
2026-04-07 00:49:11 +00:00

2.6 KiB

type domain description confidence source created title agent scope sourcer related_claims supports reweave_edges
claim health Hallucination rates range from 1.47% for structured transcription to 64.1% for open-ended summarization demonstrating that task-specific benchmarking is required experimental npj Digital Medicine 2025, empirical testing across multiple clinical AI tasks 2026-04-03 Clinical AI hallucination rates vary 100x by task making single regulatory thresholds operationally inadequate vida structural npj Digital Medicine
AI scribes reached 92 percent provider adoption in under 3 years because documentation is the rare healthcare workflow where AI value is immediate unambiguous and low-risk
healthcare AI regulation needs blank-sheet redesign because the FDA drug-and-device model built for static products cannot govern continuously learning software
No regulatory body globally has established mandatory hallucination rate benchmarks for clinical AI despite evidence base and proposed frameworks
Clinical AI errors are 76 percent omissions not commissions inverting the hallucination safety model
No regulatory body globally has established mandatory hallucination rate benchmarks for clinical AI despite evidence base and proposed frameworks|supports|2026-04-04
Clinical AI errors are 76 percent omissions not commissions inverting the hallucination safety model|supports|2026-04-07

Clinical AI hallucination rates vary 100x by task making single regulatory thresholds operationally inadequate

Empirical testing reveals clinical AI hallucination rates span a 100x range depending on task complexity: ambient scribes (structured transcription) achieve 1.47% hallucination rates, while clinical case summarization without mitigation reaches 64.1%. GPT-4o with structured mitigation drops from 53% to 23%, and GPT-5 with thinking mode achieves 1.6% on HealthBench. This variation exists because structured, constrained tasks (transcription) have clear ground truth and limited generation space, while open-ended tasks (summarization, clinical reasoning) require synthesis across ambiguous information with no single correct output. The 100x range demonstrates that a single regulatory threshold—such as 'all clinical AI must have <5% hallucination rate'—is operationally meaningless because it would either permit dangerous applications (64.1% summarization) or prohibit safe ones (1.47% transcription) depending on where the threshold is set. Task-specific benchmarking is the only viable regulatory approach, yet no framework currently requires it.