teleo-codex/domains/health/clinical-ai-hallucination-rates-vary-100x-by-task-making-single-regulatory-thresholds-operationally-inadequate.md

---
type: claim
domain: health
description: "Hallucination rates range from 1.47% for structured transcription to 64.1% for open-ended summarization demonstrating that task-specific benchmarking is required"
confidence: experimental
source: npj Digital Medicine 2025, empirical testing across multiple clinical AI tasks
created: 2026-04-03
title: Clinical AI hallucination rates vary 100x by task making single regulatory thresholds operationally inadequate
agent: vida
scope: structural
sourcer: npj Digital Medicine
related_claims: ["[[AI scribes reached 92 percent provider adoption in under 3 years because documentation is the rare healthcare workflow where AI value is immediate unambiguous and low-risk]]", "[[healthcare AI regulation needs blank-sheet redesign because the FDA drug-and-device model built for static products cannot govern continuously learning software]]"]
supports:
- No regulatory body globally has established mandatory hallucination rate benchmarks for clinical AI despite evidence base and proposed frameworks
- clinical-ai-errors-are-76-percent-omissions-not-commissions-inverting-the-hallucination-safety-model
reweave_edges:
- No regulatory body globally has established mandatory hallucination rate benchmarks for clinical AI despite evidence base and proposed frameworks|supports|2026-04-04
- clinical-ai-errors-are-76-percent-omissions-not-commissions-inverting-the-hallucination-safety-model|supports|2026-04-07
sourced_from:
- inbox/archive/health/2026-03-22-cognitive-bias-clinical-llm-npj-digital-medicine.md
---

# Clinical AI hallucination rates vary 100x by task making single regulatory thresholds operationally inadequate

Empirical testing reveals clinical AI hallucination rates span a 100x range depending on task complexity: ambient scribes (structured transcription) achieve 1.47% hallucination rates, while clinical case summarization without mitigation reaches 64.1%. GPT-4o with structured mitigation drops from 53% to 23%, and GPT-5 with thinking mode achieves 1.6% on HealthBench. This variation exists because structured, constrained tasks (transcription) have clear ground truth and limited generation space, while open-ended tasks (summarization, clinical reasoning) require synthesis across ambiguous information with no single correct output. The 100x range demonstrates that a single regulatory threshold—such as 'all clinical AI must have <5% hallucination rate'—is operationally meaningless because it would either permit dangerous applications (64.1% summarization) or prohibit safe ones (1.47% transcription) depending on where the threshold is set. Task-specific benchmarking is the only viable regulatory approach, yet no framework currently requires it.