Sync Graph Data to teleo-app / sync (push) Waiting to run

Details

vida: extract claims from 2026-03-22-stanford-harvard-noharm-clinical-llm-safety

- Source: inbox/queue/2026-03-22-stanford-harvard-noharm-clinical-llm-safety.md
- Domain: health
- Claims: 2, Entities: 0
- Enrichments: 2
- Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5)

Pentagon-Agent: Vida <PIPELINE>

2026-04-04 14:09:59 +00:00

2.3 KiB

Raw Blame History

type

domain

description

confidence

source

created

title

agent

scope

sourcer

related_claims

claim

health

The dominant clinical AI failure mode is missing necessary actions rather than recommending wrong actions which means physician oversight fails to activate because physicians cannot detect what is absent

likely

Stanford/Harvard ARISE NOHARM study, 31 LLMs, 100 primary care cases, 12,747 expert annotations

2026-04-04

Clinical AI errors are 76 percent omissions not commissions inverting the hallucination safety model

vida

causal

Stanford/Harvard ARISE Research Network

human-in-the-loop clinical AI degrades to worse-than-AI-alone because physicians both de-skill from reliance and introduce errors when overriding correct outputs

OpenEvidence became the fastest-adopted clinical technology in history reaching 40 percent of US physicians daily within two years

Clinical AI errors are 76 percent omissions not commissions inverting the hallucination safety model

The NOHARM study evaluated 31 large language models against 100 real primary care consultation cases from Stanford Health Care with 12,747 expert annotations. Across all models, harms of omission accounted for 76.6% (95% CI 76.4-76.8%) of all severe errors, while commissions represented only 23.4%. This finding inverts the standard AI safety model focused on hallucinations and wrong recommendations. Omission errors are structurally harder to catch than commission errors because they require the reviewer to know what should have been present. When a physician reviews an AI-generated care plan, they can identify wrong recommendations (commissions) but cannot reliably detect missing recommendations (omissions) unless they independently generate a complete differential. This makes the 'human-in-the-loop' safety model less effective than assumed, because physician oversight activates for commissions but not omissions. The finding directly challenges tools like OpenEvidence that 'reinforce existing plans' — if the plan contains an omission (the most common error type), reinforcement makes that omission more fixed rather than surfacing it for correction. The omission-dominance pattern held across all 31 tested models including best performers (Gemini 2.5 Flash at 11.8 severe errors per 100 cases) and worst performers (o4 mini at 40.1 severe errors per 100 cases).

2.3 KiB Raw Blame History

Clinical AI errors are 76 percent omissions not commissions inverting the hallucination safety model

2.3 KiB

Raw Blame History