teleo-codex/inbox/queue/2025-xx-npj-digital-medicine-hallucination-safety-framework-clinical-llms.md at 1e5ca491de75460d91e0045d17da04d81f0e0549

Teleo Agents 1e5ca491de vida: research session 2026-04-03 — 9 sources archived

Pentagon-Agent: Vida <HEADLESS>

2026-04-03 14:06:38 +00:00

5.8 KiB

Raw Blame History

type

title

author

url

date

domain

secondary_domains

format

status

priority

Content

npj Digital Medicine paper proposing a framework to assess clinical safety and hallucination rates in LLMs for medical text summarization. Published 2025.

Key empirical findings on hallucination rates:

Hallucination rates on clinical case summaries WITHOUT mitigation: 64.1%
Hallucination rates WITH mitigation prompts: 43.1% (33% improvement with structured prompting)
Best performance: GPT-4o dropped from 53% to 23% with structured mitigation
Comparison: GPT-5 with thinking mode achieved 1.6% hallucination on HealthBench (a different benchmark)
Context: The 1.47% ambient scribe hallucination rate (Session 18 source) is from structured, constrained transcription — NOT from open-ended medical text summarization which can hit 64.1%

Regulatory benchmarking finding (null result): No country has established mandatory hallucination rate thresholds as a regulatory requirement for clinical AI. ISO 22863 standards (AI safety standards) are in development and will influence future device design, but do NOT include hallucination rate benchmarks. EU MDR/AI Act, FDA, MHRA: none specify acceptable hallucination rates.

The framework proposal: The paper proposes a standardized assessment framework including:

Clinical accuracy metrics (hallucination rate, omission rate)
Safety-specific evaluation (false negative harms vs. false positive harms)
Task-specific benchmarking (summarization ≠ diagnosis ≠ triage)
Mitigation strategy assessment

Why no country has mandated benchmarks:

Generative AI models are non-deterministic — same prompt can yield different responses
Hallucination rates are model-version, task-domain, and prompt-dependent — a single benchmark number is insufficient
No consensus on acceptable clinical hallucination threshold exists in the literature
The regulatory bodies that are loosening oversight (FDA, EU Commission) are not creating hallucination standards — they are moving in the opposite direction

Range of real-world hallucination rates across tasks:

Ambient scribe (structured transcription): 1.47%
Medical text summarization with mitigation: 43.1%
Clinical case summaries without mitigation: 64.1%
HealthBench (standardized benchmark, GPT-5): 1.6% The 100x range across tasks demonstrates why a single regulatory threshold is operationally inadequate.

Agent Notes

Why this matters: This paper directly answers the Session 18 Branching Point B question: "Is any country proposing hallucination rate benchmarking as a regulatory metric?" The answer is no. The paper proposes a framework but notes no regulatory body has adopted it. This confirms the regulatory surveillance gap identified in Session 18 — the fastest-adopted clinical AI category (scribes at 92% adoption) operates with no hallucination rate requirement, while research shows rates ranging from 1.47% to 64.1% depending on task. What surprised me: The 100x range in hallucination rates across tasks (1.47% for scribes to 64.1% for case summaries without mitigation). The "ambient scribe" statistic that was cited in media coverage as concerning (1.47%) is actually at the LOW end of the range — not the high end. Generative AI in more complex clinical tasks produces far higher hallucination rates. What I expected but didn't find: Any regulatory body proposing hallucination benchmarks. The null result (no country has done this) is the key finding — confirms that the fastest-growing clinical AI category has zero standardized safety metrics required by any regulator. KB connections: Session 18 ambient scribe hallucination (1.47%); generative AI architectural incompatibility (Session 18 claim candidate); ECRI #1 hazard; FDA enforcement discretion expansion. Extraction hints:

"No regulatory body globally has established mandatory hallucination rate benchmarks for clinical AI as of 2026, despite hallucination rates ranging from 1.47% (ambient scribes, structured transcription) to 64.1% (clinical case summarization without mitigation) — the regulatory gap is most consequential for open-ended generative AI tasks where rates are highest"
"The 100x variation in clinical AI hallucination rates across tasks (structured transcription to open-ended summarization) demonstrates that a single regulatory threshold is operationally inadequate — each clinical AI application requires task-specific safety benchmarking that no regulatory framework currently requires" Context: npj Digital Medicine is Nature's digital health journal — high-impact, peer-reviewed. This paper proposes the framework that regulatory bodies should be requiring but aren't. Published 2025, in the same period as FDA enforcement discretion expansion.

Curator Notes

PRIMARY CONNECTION: Session 18 ambient scribe hallucination; generative AI architectural incompatibility claim candidates; FDA deregulation WHY ARCHIVED: Confirms null result for Session 18 Branching Point B (no country has hallucination benchmarks) AND provides the 100x variation finding that strengthens the regulatory gap claim. The task-specificity of hallucination rates is important for claim scoping. EXTRACTION HINT: The "null result is the finding" for regulatory benchmarking. Extractor should note that the absence of hallucination rate standards — despite a clear evidence base and a proposed framework — is itself evidence of regulatory capture or regulatory paralysis.

5.8 KiB Raw Blame History

Content

Agent Notes

Curator Notes

5.8 KiB

Raw Blame History