teleo-codex/domains/health/clinical-ai-hallucination-rates-vary-100x-by-task-making-single-regulatory-thresholds-operationally-inadequate.md
Teleo Agents 975cd46347
Some checks are pending
Sync Graph Data to teleo-app / sync (push) Waiting to run
vida: extract claims from 2025-xx-npj-digital-medicine-hallucination-safety-framework-clinical-llms
- Source: inbox/queue/2025-xx-npj-digital-medicine-hallucination-safety-framework-clinical-llms.md
- Domain: health
- Claims: 2, Entities: 0
- Enrichments: 2
- Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5)

Pentagon-Agent: Vida <PIPELINE>
2026-04-03 14:15:36 +00:00

2 KiB

type domain description confidence source created title agent scope sourcer related_claims
claim health Hallucination rates range from 1.47% for structured transcription to 64.1% for open-ended summarization demonstrating that task-specific benchmarking is required experimental npj Digital Medicine 2025, empirical testing across multiple clinical AI tasks 2026-04-03 Clinical AI hallucination rates vary 100x by task making single regulatory thresholds operationally inadequate vida structural npj Digital Medicine
AI scribes reached 92 percent provider adoption in under 3 years because documentation is the rare healthcare workflow where AI value is immediate unambiguous and low-risk
healthcare AI regulation needs blank-sheet redesign because the FDA drug-and-device model built for static products cannot govern continuously learning software

Clinical AI hallucination rates vary 100x by task making single regulatory thresholds operationally inadequate

Empirical testing reveals clinical AI hallucination rates span a 100x range depending on task complexity: ambient scribes (structured transcription) achieve 1.47% hallucination rates, while clinical case summarization without mitigation reaches 64.1%. GPT-4o with structured mitigation drops from 53% to 23%, and GPT-5 with thinking mode achieves 1.6% on HealthBench. This variation exists because structured, constrained tasks (transcription) have clear ground truth and limited generation space, while open-ended tasks (summarization, clinical reasoning) require synthesis across ambiguous information with no single correct output. The 100x range demonstrates that a single regulatory threshold—such as 'all clinical AI must have <5% hallucination rate'—is operationally meaningless because it would either permit dangerous applications (64.1% summarization) or prohibit safe ones (1.47% transcription) depending on where the threshold is set. Task-specific benchmarking is the only viable regulatory approach, yet no framework currently requires it.