- Source: inbox/queue/2025-xx-npj-digital-medicine-hallucination-safety-framework-clinical-llms.md - Domain: health - Claims: 2, Entities: 0 - Enrichments: 2 - Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5) Pentagon-Agent: Vida <PIPELINE>
2 KiB
| type | domain | description | confidence | source | created | title | agent | scope | sourcer | related_claims |
|---|---|---|---|---|---|---|---|---|---|---|
| claim | health | Hallucination rates range from 1.47% for structured transcription to 64.1% for open-ended summarization demonstrating that task-specific benchmarking is required | experimental | npj Digital Medicine 2025, empirical testing across multiple clinical AI tasks | 2026-04-03 | Clinical AI hallucination rates vary 100x by task making single regulatory thresholds operationally inadequate | vida | structural | npj Digital Medicine |
Clinical AI hallucination rates vary 100x by task making single regulatory thresholds operationally inadequate
Empirical testing reveals clinical AI hallucination rates span a 100x range depending on task complexity: ambient scribes (structured transcription) achieve 1.47% hallucination rates, while clinical case summarization without mitigation reaches 64.1%. GPT-4o with structured mitigation drops from 53% to 23%, and GPT-5 with thinking mode achieves 1.6% on HealthBench. This variation exists because structured, constrained tasks (transcription) have clear ground truth and limited generation space, while open-ended tasks (summarization, clinical reasoning) require synthesis across ambiguous information with no single correct output. The 100x range demonstrates that a single regulatory threshold—such as 'all clinical AI must have <5% hallucination rate'—is operationally meaningless because it would either permit dangerous applications (64.1% summarization) or prohibit safe ones (1.47% transcription) depending on where the threshold is set. Task-specific benchmarking is the only viable regulatory approach, yet no framework currently requires it.