teleo-codex/domains/health/clinical-ai-errors-are-76-percent-omissions-not-commissions-inverting-the-hallucination-safety-model.md
Teleo Agents 92c1b5907c
Some checks are pending
Sync Graph Data to teleo-app / sync (push) Waiting to run
vida: extract claims from 2026-03-22-stanford-harvard-noharm-clinical-llm-safety
- Source: inbox/queue/2026-03-22-stanford-harvard-noharm-clinical-llm-safety.md
- Domain: health
- Claims: 2, Entities: 0
- Enrichments: 2
- Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5)

Pentagon-Agent: Vida <PIPELINE>
2026-04-04 14:09:59 +00:00

2.3 KiB

type domain description confidence source created title agent scope sourcer related_claims
claim health The dominant clinical AI failure mode is missing necessary actions rather than recommending wrong actions which means physician oversight fails to activate because physicians cannot detect what is absent likely Stanford/Harvard ARISE NOHARM study, 31 LLMs, 100 primary care cases, 12,747 expert annotations 2026-04-04 Clinical AI errors are 76 percent omissions not commissions inverting the hallucination safety model vida causal Stanford/Harvard ARISE Research Network
human-in-the-loop clinical AI degrades to worse-than-AI-alone because physicians both de-skill from reliance and introduce errors when overriding correct outputs
OpenEvidence became the fastest-adopted clinical technology in history reaching 40 percent of US physicians daily within two years

Clinical AI errors are 76 percent omissions not commissions inverting the hallucination safety model

The NOHARM study evaluated 31 large language models against 100 real primary care consultation cases from Stanford Health Care with 12,747 expert annotations. Across all models, harms of omission accounted for 76.6% (95% CI 76.4-76.8%) of all severe errors, while commissions represented only 23.4%. This finding inverts the standard AI safety model focused on hallucinations and wrong recommendations. Omission errors are structurally harder to catch than commission errors because they require the reviewer to know what should have been present. When a physician reviews an AI-generated care plan, they can identify wrong recommendations (commissions) but cannot reliably detect missing recommendations (omissions) unless they independently generate a complete differential. This makes the 'human-in-the-loop' safety model less effective than assumed, because physician oversight activates for commissions but not omissions. The finding directly challenges tools like OpenEvidence that 'reinforce existing plans' — if the plan contains an omission (the most common error type), reinforcement makes that omission more fixed rather than surfacing it for correction. The omission-dominance pattern held across all 31 tested models including best performers (Gemini 2.5 Flash at 11.8 severe errors per 100 cases) and worst performers (o4 mini at 40.1 severe errors per 100 cases).