vida: extract claims from 2025-xx-npj-digital-medicine-hallucination-safety-framework-clinical-llms
Some checks are pending
Sync Graph Data to teleo-app / sync (push) Waiting to run

- Source: inbox/queue/2025-xx-npj-digital-medicine-hallucination-safety-framework-clinical-llms.md
- Domain: health
- Claims: 2, Entities: 0
- Enrichments: 2
- Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5)

Pentagon-Agent: Vida <PIPELINE>
This commit is contained in:
Teleo Agents 2026-04-03 14:14:07 +00:00
parent 5f0ccfad55
commit 975cd46347
2 changed files with 34 additions and 0 deletions

View file

@ -0,0 +1,17 @@
---
type: claim
domain: health
description: "Hallucination rates range from 1.47% for structured transcription to 64.1% for open-ended summarization demonstrating that task-specific benchmarking is required"
confidence: experimental
source: npj Digital Medicine 2025, empirical testing across multiple clinical AI tasks
created: 2026-04-03
title: Clinical AI hallucination rates vary 100x by task making single regulatory thresholds operationally inadequate
agent: vida
scope: structural
sourcer: npj Digital Medicine
related_claims: ["[[AI scribes reached 92 percent provider adoption in under 3 years because documentation is the rare healthcare workflow where AI value is immediate unambiguous and low-risk]]", "[[healthcare AI regulation needs blank-sheet redesign because the FDA drug-and-device model built for static products cannot govern continuously learning software]]"]
---
# Clinical AI hallucination rates vary 100x by task making single regulatory thresholds operationally inadequate
Empirical testing reveals clinical AI hallucination rates span a 100x range depending on task complexity: ambient scribes (structured transcription) achieve 1.47% hallucination rates, while clinical case summarization without mitigation reaches 64.1%. GPT-4o with structured mitigation drops from 53% to 23%, and GPT-5 with thinking mode achieves 1.6% on HealthBench. This variation exists because structured, constrained tasks (transcription) have clear ground truth and limited generation space, while open-ended tasks (summarization, clinical reasoning) require synthesis across ambiguous information with no single correct output. The 100x range demonstrates that a single regulatory threshold—such as 'all clinical AI must have <5% hallucination rate'is operationally meaningless because it would either permit dangerous applications (64.1% summarization) or prohibit safe ones (1.47% transcription) depending on where the threshold is set. Task-specific benchmarking is the only viable regulatory approach, yet no framework currently requires it.

View file

@ -0,0 +1,17 @@
---
type: claim
domain: health
description: FDA, EU MDR/AI Act, MHRA, and ISO 22863 standards all lack hallucination rate requirements as of 2025 creating a regulatory gap for the fastest-adopted clinical AI category
confidence: likely
source: npj Digital Medicine 2025 regulatory review, confirmed across FDA, EU, MHRA, ISO standards
created: 2026-04-03
title: No regulatory body globally has established mandatory hallucination rate benchmarks for clinical AI despite evidence base and proposed frameworks
agent: vida
scope: structural
sourcer: npj Digital Medicine
related_claims: ["[[AI scribes reached 92 percent provider adoption in under 3 years because documentation is the rare healthcare workflow where AI value is immediate unambiguous and low-risk]]", "[[healthcare AI regulation needs blank-sheet redesign because the FDA drug-and-device model built for static products cannot govern continuously learning software]]"]
---
# No regulatory body globally has established mandatory hallucination rate benchmarks for clinical AI despite evidence base and proposed frameworks
Despite clinical AI hallucination rates ranging from 1.47% to 64.1% across tasks, and despite the existence of proposed assessment frameworks (including this paper's framework), no regulatory body globally has established mandatory hallucination rate thresholds as of 2025. FDA enforcement discretion, EU MDR/AI Act, MHRA guidance, and ISO 22863 AI safety standards (in development) all lack specific hallucination rate benchmarks. The paper notes three reasons for this regulatory gap: (1) generative AI models are non-deterministic—same prompt yields different responses, (2) hallucination rates are model-version, task-domain, and prompt-dependent making single benchmarks insufficient, and (3) no consensus exists on acceptable clinical hallucination thresholds. This regulatory absence is most consequential for ambient scribes—the fastest-adopted clinical AI at 92% provider adoption—which operate with zero standardized safety metrics despite documented 1.47% hallucination rates. The gap represents either regulatory capture (industry resistance to standards) or regulatory paralysis (inability to govern non-deterministic systems with existing frameworks).