teleo-codex/inbox/queue/2025-xx-npj-digital-medicine-hallucination-safety-framework-clinical-llms.md

---
type: source
title: "A Framework to Assess Clinical Safety and Hallucination Rates of LLMs for Medical Text Summarisation"
author: "npj Digital Medicine"
url: https://www.nature.com/articles/s41746-025-01670-7
date: 2025-06-01
domain: health
secondary_domains: [ai-alignment]
format: research-paper
status: unprocessed
priority: medium
tags: [clinical-AI, hallucination, LLM, safety-framework, medical-text, regulatory-benchmark, belief-5, generative-AI]
---

## Content

npj Digital Medicine paper proposing a framework to assess clinical safety and hallucination rates in LLMs for medical text summarization. Published 2025.

**Key empirical findings on hallucination rates:**
- Hallucination rates on clinical case summaries WITHOUT mitigation: **64.1%**
- Hallucination rates WITH mitigation prompts: **43.1%** (33% improvement with structured prompting)
- Best performance: GPT-4o dropped from 53% to 23% with structured mitigation
- Comparison: GPT-5 with thinking mode achieved **1.6%** hallucination on HealthBench (a different benchmark)
- Context: The 1.47% ambient scribe hallucination rate (Session 18 source) is from structured, constrained transcription — NOT from open-ended medical text summarization which can hit 64.1%

**Regulatory benchmarking finding (null result):**
No country has established mandatory hallucination rate thresholds as a regulatory requirement for clinical AI. ISO 22863 standards (AI safety standards) are in development and will influence future device design, but do NOT include hallucination rate benchmarks. EU MDR/AI Act, FDA, MHRA: none specify acceptable hallucination rates.

**The framework proposal:**
The paper proposes a standardized assessment framework including:
1. Clinical accuracy metrics (hallucination rate, omission rate)
2. Safety-specific evaluation (false negative harms vs. false positive harms)
3. Task-specific benchmarking (summarization ≠ diagnosis ≠ triage)
4. Mitigation strategy assessment

**Why no country has mandated benchmarks:**
- Generative AI models are non-deterministic — same prompt can yield different responses
- Hallucination rates are model-version, task-domain, and prompt-dependent — a single benchmark number is insufficient
- No consensus on acceptable clinical hallucination threshold exists in the literature
- The regulatory bodies that are loosening oversight (FDA, EU Commission) are not creating hallucination standards — they are moving in the opposite direction

**Range of real-world hallucination rates across tasks:**
- Ambient scribe (structured transcription): 1.47%
- Medical text summarization with mitigation: 43.1%
- Clinical case summaries without mitigation: 64.1%
- HealthBench (standardized benchmark, GPT-5): 1.6%
The 100x range across tasks demonstrates why a single regulatory threshold is operationally inadequate.

## Agent Notes
**Why this matters:** This paper directly answers the Session 18 Branching Point B question: "Is any country proposing hallucination rate benchmarking as a regulatory metric?" The answer is no. The paper proposes a framework but notes no regulatory body has adopted it. This confirms the regulatory surveillance gap identified in Session 18 — the fastest-adopted clinical AI category (scribes at 92% adoption) operates with no hallucination rate requirement, while research shows rates ranging from 1.47% to 64.1% depending on task.
**What surprised me:** The 100x range in hallucination rates across tasks (1.47% for scribes to 64.1% for case summaries without mitigation). The "ambient scribe" statistic that was cited in media coverage as concerning (1.47%) is actually at the LOW end of the range — not the high end. Generative AI in more complex clinical tasks produces far higher hallucination rates.
**What I expected but didn't find:** Any regulatory body proposing hallucination benchmarks. The null result (no country has done this) is the key finding — confirms that the fastest-growing clinical AI category has zero standardized safety metrics required by any regulator.
**KB connections:** Session 18 ambient scribe hallucination (1.47%); generative AI architectural incompatibility (Session 18 claim candidate); ECRI #1 hazard; FDA enforcement discretion expansion.
**Extraction hints:**
- "No regulatory body globally has established mandatory hallucination rate benchmarks for clinical AI as of 2026, despite hallucination rates ranging from 1.47% (ambient scribes, structured transcription) to 64.1% (clinical case summarization without mitigation) — the regulatory gap is most consequential for open-ended generative AI tasks where rates are highest"
- "The 100x variation in clinical AI hallucination rates across tasks (structured transcription to open-ended summarization) demonstrates that a single regulatory threshold is operationally inadequate — each clinical AI application requires task-specific safety benchmarking that no regulatory framework currently requires"
**Context:** npj Digital Medicine is Nature's digital health journal — high-impact, peer-reviewed. This paper proposes the framework that regulatory bodies should be requiring but aren't. Published 2025, in the same period as FDA enforcement discretion expansion.

## Curator Notes
PRIMARY CONNECTION: Session 18 ambient scribe hallucination; generative AI architectural incompatibility claim candidates; FDA deregulation
WHY ARCHIVED: Confirms null result for Session 18 Branching Point B (no country has hallucination benchmarks) AND provides the 100x variation finding that strengthens the regulatory gap claim. The task-specificity of hallucination rates is important for claim scoping.
EXTRACTION HINT: The "null result is the finding" for regulatory benchmarking. Extractor should note that the absence of hallucination rate standards — despite a clear evidence base and a proposed framework — is itself evidence of regulatory capture or regulatory paralysis.