vida: extract claims from 2025-xx-npj-digital-medicine-hallucination-safety-framework-clinical-llms

- Source: inbox/queue/2025-xx-npj-digital-medicine-hallucination-safety-framework-clinical-llms.md - Domain: health - Claims: 2, Entities: 0 - Enrichments: 2 - Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5) Pentagon-Agent: Vida <PIPELINE>
2026-04-03 14:14:07 +00:00 · 2026-04-03 14:14:07 +00:00 · 975cd46347
commit 975cd46347
parent 5f0ccfad55
2 changed files with 34 additions and 0 deletions
--- a/domains/health/clinical-ai-hallucination-rates-vary-100x-by-task-making-single-regulatory-thresholds-operationally-inadequate.md
+++ b/domains/health/clinical-ai-hallucination-rates-vary-100x-by-task-making-single-regulatory-thresholds-operationally-inadequate.md
@ -0,0 +1,17 @@
+---
+type: claim
+domain: health
+description: "Hallucination rates range from 1.47% for structured transcription to 64.1% for open-ended summarization demonstrating that task-specific benchmarking is required"
+confidence: experimental
+source: npj Digital Medicine 2025, empirical testing across multiple clinical AI tasks
+created: 2026-04-03
+title: Clinical AI hallucination rates vary 100x by task making single regulatory thresholds operationally inadequate
+agent: vida
+scope: structural
+sourcer: npj Digital Medicine
+related_claims: ["[[AI scribes reached 92 percent provider adoption in under 3 years because documentation is the rare healthcare workflow where AI value is immediate unambiguous and low-risk]]", "[[healthcare AI regulation needs blank-sheet redesign because the FDA drug-and-device model built for static products cannot govern continuously learning software]]"]
+---
+
+# Clinical AI hallucination rates vary 100x by task making single regulatory thresholds operationally inadequate
+
+Empirical testing reveals clinical AI hallucination rates span a 100x range depending on task complexity: ambient scribes (structured transcription) achieve 1.47% hallucination rates, while clinical case summarization without mitigation reaches 64.1%. GPT-4o with structured mitigation drops from 53% to 23%, and GPT-5 with thinking mode achieves 1.6% on HealthBench. This variation exists because structured, constrained tasks (transcription) have clear ground truth and limited generation space, while open-ended tasks (summarization, clinical reasoning) require synthesis across ambiguous information with no single correct output. The 100x range demonstrates that a single regulatory threshold—such as 'all clinical AI must have <5% hallucination rate'—is operationally meaningless because it would either permit dangerous applications (64.1% summarization) or prohibit safe ones (1.47% transcription) depending on where the threshold is set. Task-specific benchmarking is the only viable regulatory approach, yet no framework currently requires it.
--- a/domains/health/no-regulatory-body-globally-has-established-mandatory-hallucination-rate-benchmarks-for-clinical-ai-despite-evidence-base.md
+++ b/domains/health/no-regulatory-body-globally-has-established-mandatory-hallucination-rate-benchmarks-for-clinical-ai-despite-evidence-base.md
@ -0,0 +1,17 @@
+---
+type: claim
+domain: health
+description: FDA, EU MDR/AI Act, MHRA, and ISO 22863 standards all lack hallucination rate requirements as of 2025 creating a regulatory gap for the fastest-adopted clinical AI category
+confidence: likely
+source: npj Digital Medicine 2025 regulatory review, confirmed across FDA, EU, MHRA, ISO standards
+created: 2026-04-03
+title: No regulatory body globally has established mandatory hallucination rate benchmarks for clinical AI despite evidence base and proposed frameworks
+agent: vida
+scope: structural
+sourcer: npj Digital Medicine
+related_claims: ["[[AI scribes reached 92 percent provider adoption in under 3 years because documentation is the rare healthcare workflow where AI value is immediate unambiguous and low-risk]]", "[[healthcare AI regulation needs blank-sheet redesign because the FDA drug-and-device model built for static products cannot govern continuously learning software]]"]
+---
+
+# No regulatory body globally has established mandatory hallucination rate benchmarks for clinical AI despite evidence base and proposed frameworks
+
+Despite clinical AI hallucination rates ranging from 1.47% to 64.1% across tasks, and despite the existence of proposed assessment frameworks (including this paper's framework), no regulatory body globally has established mandatory hallucination rate thresholds as of 2025. FDA enforcement discretion, EU MDR/AI Act, MHRA guidance, and ISO 22863 AI safety standards (in development) all lack specific hallucination rate benchmarks. The paper notes three reasons for this regulatory gap: (1) generative AI models are non-deterministic—same prompt yields different responses, (2) hallucination rates are model-version, task-domain, and prompt-dependent making single benchmarks insufficient, and (3) no consensus exists on acceptable clinical hallucination thresholds. This regulatory absence is most consequential for ambient scribes—the fastest-adopted clinical AI at 92% provider adoption—which operate with zero standardized safety metrics despite documented 1.47% hallucination rates. The gap represents either regulatory capture (industry resistance to standards) or regulatory paralysis (inability to govern non-deterministic systems with existing frameworks).