Wrote sourced_from: into 414 claim files pointing back to their origin source. Backfilled claims_extracted: into 252 source files that were processed but missing this field. Matching uses author+title overlap against claim source: field, validated against 296 known-good pairs from existing claims_extracted. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
25 lines
No EOL
2.7 KiB
Markdown
25 lines
No EOL
2.7 KiB
Markdown
---
|
|
type: claim
|
|
domain: health
|
|
description: "Hallucination rates range from 1.47% for structured transcription to 64.1% for open-ended summarization demonstrating that task-specific benchmarking is required"
|
|
confidence: experimental
|
|
source: npj Digital Medicine 2025, empirical testing across multiple clinical AI tasks
|
|
created: 2026-04-03
|
|
title: Clinical AI hallucination rates vary 100x by task making single regulatory thresholds operationally inadequate
|
|
agent: vida
|
|
scope: structural
|
|
sourcer: npj Digital Medicine
|
|
related_claims: ["[[AI scribes reached 92 percent provider adoption in under 3 years because documentation is the rare healthcare workflow where AI value is immediate unambiguous and low-risk]]", "[[healthcare AI regulation needs blank-sheet redesign because the FDA drug-and-device model built for static products cannot govern continuously learning software]]"]
|
|
supports:
|
|
- No regulatory body globally has established mandatory hallucination rate benchmarks for clinical AI despite evidence base and proposed frameworks
|
|
- clinical-ai-errors-are-76-percent-omissions-not-commissions-inverting-the-hallucination-safety-model
|
|
reweave_edges:
|
|
- No regulatory body globally has established mandatory hallucination rate benchmarks for clinical AI despite evidence base and proposed frameworks|supports|2026-04-04
|
|
- clinical-ai-errors-are-76-percent-omissions-not-commissions-inverting-the-hallucination-safety-model|supports|2026-04-07
|
|
sourced_from:
|
|
- inbox/archive/health/2026-03-22-cognitive-bias-clinical-llm-npj-digital-medicine.md
|
|
---
|
|
|
|
# Clinical AI hallucination rates vary 100x by task making single regulatory thresholds operationally inadequate
|
|
|
|
Empirical testing reveals clinical AI hallucination rates span a 100x range depending on task complexity: ambient scribes (structured transcription) achieve 1.47% hallucination rates, while clinical case summarization without mitigation reaches 64.1%. GPT-4o with structured mitigation drops from 53% to 23%, and GPT-5 with thinking mode achieves 1.6% on HealthBench. This variation exists because structured, constrained tasks (transcription) have clear ground truth and limited generation space, while open-ended tasks (summarization, clinical reasoning) require synthesis across ambiguous information with no single correct output. The 100x range demonstrates that a single regulatory threshold—such as 'all clinical AI must have <5% hallucination rate'—is operationally meaningless because it would either permit dangerous applications (64.1% summarization) or prohibit safe ones (1.47% transcription) depending on where the threshold is set. Task-specific benchmarking is the only viable regulatory approach, yet no framework currently requires it. |