vida: extract claims from 2026-03-22-stanford-harvard-noharm-clinical-llm-safety
Some checks are pending
Sync Graph Data to teleo-app / sync (push) Waiting to run

- Source: inbox/queue/2026-03-22-stanford-harvard-noharm-clinical-llm-safety.md
- Domain: health
- Claims: 2, Entities: 0
- Enrichments: 2
- Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5)

Pentagon-Agent: Vida <PIPELINE>
This commit is contained in:
Teleo Agents 2026-04-04 14:09:01 +00:00
parent 2b4392c8de
commit 92c1b5907c
2 changed files with 34 additions and 0 deletions

View file

@ -0,0 +1,17 @@
---
type: claim
domain: health
description: The dominant clinical AI failure mode is missing necessary actions rather than recommending wrong actions which means physician oversight fails to activate because physicians cannot detect what is absent
confidence: likely
source: Stanford/Harvard ARISE NOHARM study, 31 LLMs, 100 primary care cases, 12,747 expert annotations
created: 2026-04-04
title: Clinical AI errors are 76 percent omissions not commissions inverting the hallucination safety model
agent: vida
scope: causal
sourcer: Stanford/Harvard ARISE Research Network
related_claims: ["[[human-in-the-loop clinical AI degrades to worse-than-AI-alone because physicians both de-skill from reliance and introduce errors when overriding correct outputs]]", "[[OpenEvidence became the fastest-adopted clinical technology in history reaching 40 percent of US physicians daily within two years]]"]
---
# Clinical AI errors are 76 percent omissions not commissions inverting the hallucination safety model
The NOHARM study evaluated 31 large language models against 100 real primary care consultation cases from Stanford Health Care with 12,747 expert annotations. Across all models, harms of omission accounted for 76.6% (95% CI 76.4-76.8%) of all severe errors, while commissions represented only 23.4%. This finding inverts the standard AI safety model focused on hallucinations and wrong recommendations. Omission errors are structurally harder to catch than commission errors because they require the reviewer to know what should have been present. When a physician reviews an AI-generated care plan, they can identify wrong recommendations (commissions) but cannot reliably detect missing recommendations (omissions) unless they independently generate a complete differential. This makes the 'human-in-the-loop' safety model less effective than assumed, because physician oversight activates for commissions but not omissions. The finding directly challenges tools like OpenEvidence that 'reinforce existing plans' — if the plan contains an omission (the most common error type), reinforcement makes that omission more fixed rather than surfacing it for correction. The omission-dominance pattern held across all 31 tested models including best performers (Gemini 2.5 Flash at 11.8 severe errors per 100 cases) and worst performers (o4 mini at 40.1 severe errors per 100 cases).

View file

@ -0,0 +1,17 @@
---
type: claim
domain: health
description: AI performance on medical knowledge exams like USMLE shows only moderate correlation with actual clinical safety outcomes challenging the use of benchmark scores as safety evidence
confidence: likely
source: Stanford/Harvard ARISE NOHARM study, correlation analysis across 31 LLMs
created: 2026-04-04
title: Medical benchmark performance does not predict clinical safety as USMLE scores correlate only 0.61 with harm rates
agent: vida
scope: correlational
sourcer: Stanford/Harvard ARISE Research Network
related_claims: ["[[medical LLM benchmark performance does not translate to clinical impact because physicians with and without AI access achieve similar diagnostic accuracy in randomized trials]]"]
---
# Medical benchmark performance does not predict clinical safety as USMLE scores correlate only 0.61 with harm rates
The NOHARM study found that safety performance (measured as severe harm rate across 100 real clinical cases) correlated only moderately with existing AI and medical benchmarks at r = 0.61-0.64. This means that a model's USMLE score or performance on other medical knowledge tests explains only 37-41% of the variance in clinical safety outcomes. The finding challenges the widespread practice of using benchmark performance as evidence of clinical safety — a practice employed by companies like OpenEvidence which markets its 100% USMLE score as a safety credential. The gap exists because medical exams test knowledge recall and reasoning on well-formed questions with clear answers, while clinical safety requires completeness (not missing necessary actions), appropriate risk stratification, and handling of ambiguous real-world presentations. A model can score perfectly on USMLE by correctly answering the questions asked while still producing high omission rates by failing to consider diagnoses or management options not explicitly prompted. The study tested 31 models spanning the performance spectrum, with best performers (Gemini 2.5 Flash, LiSA 1.0) achieving 11.8-14.6 severe errors per 100 cases and worst performers (o4 mini, GPT-4o mini) at 39.9-40.1 severe errors per 100 cases — a range that existing benchmarks fail to predict reliably.