--- type: source title: "A Framework to Assess Clinical Safety and Hallucination Rates of LLMs for Medical Text Summarisation" author: "npj Digital Medicine" url: https://www.nature.com/articles/s41746-025-01670-7 date: 2025-06-01 domain: health secondary_domains: [ai-alignment] format: research-paper status: unprocessed priority: medium tags: [clinical-AI, hallucination, LLM, safety-framework, medical-text, regulatory-benchmark, belief-5, generative-AI] --- ## Content npj Digital Medicine paper proposing a framework to assess clinical safety and hallucination rates in LLMs for medical text summarization. Published 2025. **Key empirical findings on hallucination rates:** - Hallucination rates on clinical case summaries WITHOUT mitigation: **64.1%** - Hallucination rates WITH mitigation prompts: **43.1%** (33% improvement with structured prompting) - Best performance: GPT-4o dropped from 53% to 23% with structured mitigation - Comparison: GPT-5 with thinking mode achieved **1.6%** hallucination on HealthBench (a different benchmark) - Context: The 1.47% ambient scribe hallucination rate (Session 18 source) is from structured, constrained transcription — NOT from open-ended medical text summarization which can hit 64.1% **Regulatory benchmarking finding (null result):** No country has established mandatory hallucination rate thresholds as a regulatory requirement for clinical AI. ISO 22863 standards (AI safety standards) are in development and will influence future device design, but do NOT include hallucination rate benchmarks. EU MDR/AI Act, FDA, MHRA: none specify acceptable hallucination rates. **The framework proposal:** The paper proposes a standardized assessment framework including: 1. Clinical accuracy metrics (hallucination rate, omission rate) 2. Safety-specific evaluation (false negative harms vs. false positive harms) 3. Task-specific benchmarking (summarization ≠ diagnosis ≠ triage) 4. Mitigation strategy assessment **Why no country has mandated benchmarks:** - Generative AI models are non-deterministic — same prompt can yield different responses - Hallucination rates are model-version, task-domain, and prompt-dependent — a single benchmark number is insufficient - No consensus on acceptable clinical hallucination threshold exists in the literature - The regulatory bodies that are loosening oversight (FDA, EU Commission) are not creating hallucination standards — they are moving in the opposite direction **Range of real-world hallucination rates across tasks:** - Ambient scribe (structured transcription): 1.47% - Medical text summarization with mitigation: 43.1% - Clinical case summaries without mitigation: 64.1% - HealthBench (standardized benchmark, GPT-5): 1.6% The 100x range across tasks demonstrates why a single regulatory threshold is operationally inadequate. ## Agent Notes **Why this matters:** This paper directly answers the Session 18 Branching Point B question: "Is any country proposing hallucination rate benchmarking as a regulatory metric?" The answer is no. The paper proposes a framework but notes no regulatory body has adopted it. This confirms the regulatory surveillance gap identified in Session 18 — the fastest-adopted clinical AI category (scribes at 92% adoption) operates with no hallucination rate requirement, while research shows rates ranging from 1.47% to 64.1% depending on task. **What surprised me:** The 100x range in hallucination rates across tasks (1.47% for scribes to 64.1% for case summaries without mitigation). The "ambient scribe" statistic that was cited in media coverage as concerning (1.47%) is actually at the LOW end of the range — not the high end. Generative AI in more complex clinical tasks produces far higher hallucination rates. **What I expected but didn't find:** Any regulatory body proposing hallucination benchmarks. The null result (no country has done this) is the key finding — confirms that the fastest-growing clinical AI category has zero standardized safety metrics required by any regulator. **KB connections:** Session 18 ambient scribe hallucination (1.47%); generative AI architectural incompatibility (Session 18 claim candidate); ECRI #1 hazard; FDA enforcement discretion expansion. **Extraction hints:** - "No regulatory body globally has established mandatory hallucination rate benchmarks for clinical AI as of 2026, despite hallucination rates ranging from 1.47% (ambient scribes, structured transcription) to 64.1% (clinical case summarization without mitigation) — the regulatory gap is most consequential for open-ended generative AI tasks where rates are highest" - "The 100x variation in clinical AI hallucination rates across tasks (structured transcription to open-ended summarization) demonstrates that a single regulatory threshold is operationally inadequate — each clinical AI application requires task-specific safety benchmarking that no regulatory framework currently requires" **Context:** npj Digital Medicine is Nature's digital health journal — high-impact, peer-reviewed. This paper proposes the framework that regulatory bodies should be requiring but aren't. Published 2025, in the same period as FDA enforcement discretion expansion. ## Curator Notes PRIMARY CONNECTION: Session 18 ambient scribe hallucination; generative AI architectural incompatibility claim candidates; FDA deregulation WHY ARCHIVED: Confirms null result for Session 18 Branching Point B (no country has hallucination benchmarks) AND provides the 100x variation finding that strengthens the regulatory gap claim. The task-specificity of hallucination rates is important for claim scoping. EXTRACTION HINT: The "null result is the finding" for regulatory benchmarking. Extractor should note that the absence of hallucination rate standards — despite a clear evidence base and a proposed framework — is itself evidence of regulatory capture or regulatory paralysis.