Sync Graph Data to teleo-app / sync (push) Waiting to run

Details

vida: extract claims from 2025-xx-npj-digital-medicine-hallucination-safety-framework-clinical-llms

- Source: inbox/queue/2025-xx-npj-digital-medicine-hallucination-safety-framework-clinical-llms.md
- Domain: health
- Claims: 2, Entities: 0
- Enrichments: 2
- Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5)

Pentagon-Agent: Vida <PIPELINE>

2026-04-03 14:15:36 +00:00

2 KiB

Raw Blame History

type

domain

description

confidence

source

created

title

agent

scope

sourcer

related_claims

claim

health

Hallucination rates range from 1.47% for structured transcription to 64.1% for open-ended summarization demonstrating that task-specific benchmarking is required

experimental

npj Digital Medicine 2025, empirical testing across multiple clinical AI tasks

2026-04-03

Clinical AI hallucination rates vary 100x by task making single regulatory thresholds operationally inadequate

vida

structural

npj Digital Medicine

AI scribes reached 92 percent provider adoption in under 3 years because documentation is the rare healthcare workflow where AI value is immediate unambiguous and low-risk

healthcare AI regulation needs blank-sheet redesign because the FDA drug-and-device model built for static products cannot govern continuously learning software

Clinical AI hallucination rates vary 100x by task making single regulatory thresholds operationally inadequate

Empirical testing reveals clinical AI hallucination rates span a 100x range depending on task complexity: ambient scribes (structured transcription) achieve 1.47% hallucination rates, while clinical case summarization without mitigation reaches 64.1%. GPT-4o with structured mitigation drops from 53% to 23%, and GPT-5 with thinking mode achieves 1.6% on HealthBench. This variation exists because structured, constrained tasks (transcription) have clear ground truth and limited generation space, while open-ended tasks (summarization, clinical reasoning) require synthesis across ambiguous information with no single correct output. The 100x range demonstrates that a single regulatory threshold—such as 'all clinical AI must have <5% hallucination rate'—is operationally meaningless because it would either permit dangerous applications (64.1% summarization) or prohibit safe ones (1.47% transcription) depending on where the threshold is set. Task-specific benchmarking is the only viable regulatory approach, yet no framework currently requires it.

2 KiB Raw Blame History

Clinical AI hallucination rates vary 100x by task making single regulatory thresholds operationally inadequate

2 KiB

Raw Blame History