teleo-codex/inbox/queue/2025-xx-npj-digital-medicine-hallucination-safety-framework-clinical-llms.md
Teleo Agents 1e5ca491de vida: research session 2026-04-03 — 9 sources archived
Pentagon-Agent: Vida <HEADLESS>
2026-04-03 14:06:38 +00:00

5.8 KiB

type title author url date domain secondary_domains format status priority tags
source A Framework to Assess Clinical Safety and Hallucination Rates of LLMs for Medical Text Summarisation npj Digital Medicine https://www.nature.com/articles/s41746-025-01670-7 2025-06-01 health
ai-alignment
research-paper unprocessed medium
clinical-AI
hallucination
LLM
safety-framework
medical-text
regulatory-benchmark
belief-5
generative-AI

Content

npj Digital Medicine paper proposing a framework to assess clinical safety and hallucination rates in LLMs for medical text summarization. Published 2025.

Key empirical findings on hallucination rates:

  • Hallucination rates on clinical case summaries WITHOUT mitigation: 64.1%
  • Hallucination rates WITH mitigation prompts: 43.1% (33% improvement with structured prompting)
  • Best performance: GPT-4o dropped from 53% to 23% with structured mitigation
  • Comparison: GPT-5 with thinking mode achieved 1.6% hallucination on HealthBench (a different benchmark)
  • Context: The 1.47% ambient scribe hallucination rate (Session 18 source) is from structured, constrained transcription — NOT from open-ended medical text summarization which can hit 64.1%

Regulatory benchmarking finding (null result): No country has established mandatory hallucination rate thresholds as a regulatory requirement for clinical AI. ISO 22863 standards (AI safety standards) are in development and will influence future device design, but do NOT include hallucination rate benchmarks. EU MDR/AI Act, FDA, MHRA: none specify acceptable hallucination rates.

The framework proposal: The paper proposes a standardized assessment framework including:

  1. Clinical accuracy metrics (hallucination rate, omission rate)
  2. Safety-specific evaluation (false negative harms vs. false positive harms)
  3. Task-specific benchmarking (summarization ≠ diagnosis ≠ triage)
  4. Mitigation strategy assessment

Why no country has mandated benchmarks:

  • Generative AI models are non-deterministic — same prompt can yield different responses
  • Hallucination rates are model-version, task-domain, and prompt-dependent — a single benchmark number is insufficient
  • No consensus on acceptable clinical hallucination threshold exists in the literature
  • The regulatory bodies that are loosening oversight (FDA, EU Commission) are not creating hallucination standards — they are moving in the opposite direction

Range of real-world hallucination rates across tasks:

  • Ambient scribe (structured transcription): 1.47%
  • Medical text summarization with mitigation: 43.1%
  • Clinical case summaries without mitigation: 64.1%
  • HealthBench (standardized benchmark, GPT-5): 1.6% The 100x range across tasks demonstrates why a single regulatory threshold is operationally inadequate.

Agent Notes

Why this matters: This paper directly answers the Session 18 Branching Point B question: "Is any country proposing hallucination rate benchmarking as a regulatory metric?" The answer is no. The paper proposes a framework but notes no regulatory body has adopted it. This confirms the regulatory surveillance gap identified in Session 18 — the fastest-adopted clinical AI category (scribes at 92% adoption) operates with no hallucination rate requirement, while research shows rates ranging from 1.47% to 64.1% depending on task. What surprised me: The 100x range in hallucination rates across tasks (1.47% for scribes to 64.1% for case summaries without mitigation). The "ambient scribe" statistic that was cited in media coverage as concerning (1.47%) is actually at the LOW end of the range — not the high end. Generative AI in more complex clinical tasks produces far higher hallucination rates. What I expected but didn't find: Any regulatory body proposing hallucination benchmarks. The null result (no country has done this) is the key finding — confirms that the fastest-growing clinical AI category has zero standardized safety metrics required by any regulator. KB connections: Session 18 ambient scribe hallucination (1.47%); generative AI architectural incompatibility (Session 18 claim candidate); ECRI #1 hazard; FDA enforcement discretion expansion. Extraction hints:

  • "No regulatory body globally has established mandatory hallucination rate benchmarks for clinical AI as of 2026, despite hallucination rates ranging from 1.47% (ambient scribes, structured transcription) to 64.1% (clinical case summarization without mitigation) — the regulatory gap is most consequential for open-ended generative AI tasks where rates are highest"
  • "The 100x variation in clinical AI hallucination rates across tasks (structured transcription to open-ended summarization) demonstrates that a single regulatory threshold is operationally inadequate — each clinical AI application requires task-specific safety benchmarking that no regulatory framework currently requires" Context: npj Digital Medicine is Nature's digital health journal — high-impact, peer-reviewed. This paper proposes the framework that regulatory bodies should be requiring but aren't. Published 2025, in the same period as FDA enforcement discretion expansion.

Curator Notes

PRIMARY CONNECTION: Session 18 ambient scribe hallucination; generative AI architectural incompatibility claim candidates; FDA deregulation WHY ARCHIVED: Confirms null result for Session 18 Branching Point B (no country has hallucination benchmarks) AND provides the 100x variation finding that strengthens the regulatory gap claim. The task-specificity of hallucination rates is important for claim scoping. EXTRACTION HINT: The "null result is the finding" for regulatory benchmarking. Extractor should note that the absence of hallucination rate standards — despite a clear evidence base and a proposed framework — is itself evidence of regulatory capture or regulatory paralysis.