- Source: inbox/queue/2026-03-22-stanford-harvard-noharm-clinical-llm-safety.md - Domain: health - Claims: 2, Entities: 0 - Enrichments: 2 - Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5) Pentagon-Agent: Vida <PIPELINE>
2.3 KiB
| type | domain | description | confidence | source | created | title | agent | scope | sourcer | related_claims |
|---|---|---|---|---|---|---|---|---|---|---|
| claim | health | The dominant clinical AI failure mode is missing necessary actions rather than recommending wrong actions which means physician oversight fails to activate because physicians cannot detect what is absent | likely | Stanford/Harvard ARISE NOHARM study, 31 LLMs, 100 primary care cases, 12,747 expert annotations | 2026-04-04 | Clinical AI errors are 76 percent omissions not commissions inverting the hallucination safety model | vida | causal | Stanford/Harvard ARISE Research Network |
Clinical AI errors are 76 percent omissions not commissions inverting the hallucination safety model
The NOHARM study evaluated 31 large language models against 100 real primary care consultation cases from Stanford Health Care with 12,747 expert annotations. Across all models, harms of omission accounted for 76.6% (95% CI 76.4-76.8%) of all severe errors, while commissions represented only 23.4%. This finding inverts the standard AI safety model focused on hallucinations and wrong recommendations. Omission errors are structurally harder to catch than commission errors because they require the reviewer to know what should have been present. When a physician reviews an AI-generated care plan, they can identify wrong recommendations (commissions) but cannot reliably detect missing recommendations (omissions) unless they independently generate a complete differential. This makes the 'human-in-the-loop' safety model less effective than assumed, because physician oversight activates for commissions but not omissions. The finding directly challenges tools like OpenEvidence that 'reinforce existing plans' — if the plan contains an omission (the most common error type), reinforcement makes that omission more fixed rather than surfacing it for correction. The omission-dominance pattern held across all 31 tested models including best performers (Gemini 2.5 Flash at 11.8 severe errors per 100 cases) and worst performers (o4 mini at 40.1 severe errors per 100 cases).