51 lines
4.7 KiB
Markdown
51 lines
4.7 KiB
Markdown
---
|
|
type: source
|
|
title: "First, Do NOHARM: Towards Clinically Safe Large Language Models (Stanford/Harvard, January 2026)"
|
|
author: "Stanford/Harvard ARISE Research Network"
|
|
url: https://arxiv.org/abs/2512.01241
|
|
date: 2026-01-02
|
|
domain: health
|
|
secondary_domains: [ai-alignment]
|
|
format: research paper
|
|
status: unprocessed
|
|
priority: high
|
|
tags: [clinical-ai-safety, llm-errors, omission-bias, noharm-benchmark, stanford, harvard, clinical-benchmarks, medical-ai]
|
|
---
|
|
|
|
## Content
|
|
|
|
The NOHARM study ("First, Do NOHARM: Towards Clinically Safe Large Language Models") evaluated 31 large language models against 100 real primary care consultation cases spanning 10 medical specialties. Clinical cases were drawn from 16,399 real electronic consultations at Stanford Health Care, with 12,747 expert annotations for 4,249 clinical management options.
|
|
|
|
**Core findings:**
|
|
- Severe harm in up to **22.2% of cases** (95% CI 21.6-22.8%) across 31 tested LLMs
|
|
- **Harms of omission account for 76.6% (95% CI 76.4-76.8%) of all severe errors** — missing necessary actions, not giving wrong actions
|
|
- Best performers (Gemini 2.5 Flash, LiSA 1.0): 11.8-14.6 severe errors per 100 cases
|
|
- Worst performers (o4 mini, GPT-4o mini): 39.9-40.1 severe errors per 100 cases
|
|
- Safety performance only moderately correlated with existing AI/medical benchmarks (r = 0.61-0.64) — **USMLE scores do not predict clinical safety**
|
|
- Best models outperform generalist physicians on safety (mean difference 9.7%, 95% CI 7.0-12.5%)
|
|
- Multi-agent approach reduces harm vs. solo model (mean difference 8.0%, 95% CI 4.0-12.1%)
|
|
|
|
Published to arxiv December 2025 (2512.01241). Findings reported by Stanford Medicine January 2, 2026. Referenced in the Stanford-Harvard State of Clinical AI 2026 report.
|
|
|
|
Related coverage: ppc.land, allhealthtech.com
|
|
|
|
## Agent Notes
|
|
**Why this matters:** The NOHARM study is the most rigorous clinical AI safety evaluation to date, testing actual clinical cases (not exam questions) from a real health system, with 12,747 expert annotations. The 76.6% omission finding is the most important number: it means the dominant clinical AI failure is not "AI says wrong thing" but "AI fails to mention necessary thing." This directly reframes the OpenEvidence "reinforces plans" finding as dangerous — if OE confirms a plan containing an omission (the most common error type), it makes that omission more fixed.
|
|
|
|
**What surprised me:** Two surprises: (1) The omission percentage is much higher than commissions — this is counterintuitive because AI safety discussions focus on hallucinations (commissions). (2) Best models actually outperform generalist physicians on safety (9.7% improvement) — this means clinical AI at its best IS safer than the human baseline, which complicates simple "AI is dangerous" framings. The question becomes: does OE use best-in-class models? OE has never disclosed its architecture or safety benchmarks.
|
|
|
|
**What I expected but didn't find:** I expected more data on how often physicians override AI recommendations when errors occur. The NOHARM study doesn't include physician-AI interaction data — it only tests AI responses, not physician behavior in response to AI.
|
|
|
|
**KB connections:**
|
|
- Directly extends Belief 5 (clinical AI safety risks) with a specific error taxonomy (omission-dominant)
|
|
- Challenges the "centaur model catches errors" assumption — if errors are omissions, physician oversight doesn't activate because physician doesn't know what's missing
|
|
- Safety benchmarks (USMLE) do not correlate well with safety — challenges OpenEvidence's benchmark-based safety claims
|
|
|
|
**Extraction hints:** The omission/commission distinction is the primary extractable claim. Secondary: benchmark performance does not predict clinical safety (this challenges OE's marketing of its USMLE 100% score as evidence of safety). Tertiary: best models outperform physicians — this is the nuance that prevents simple "AI is bad" claims.
|
|
|
|
**Context:** Published in December 2025, findings widely covered January 2026. Referenced in the Stanford-Harvard ARISE State of Clinical AI 2026 report. The NOHARM benchmark (100 primary care cases, 31 models, 10 specialties) is likely to become a standard evaluation framework for clinical AI.
|
|
|
|
## Curator Notes (structured handoff for extractor)
|
|
PRIMARY CONNECTION: "clinical AI augments physicians but creates novel safety risks requiring centaur design" (Belief 5 supporting claim)
|
|
WHY ARCHIVED: Defines the dominant clinical AI failure mode (omission vs. commission) — directly reframes the risk profile of tools like OpenEvidence
|
|
EXTRACTION HINT: Focus on the 76.6% omission figure and its interaction with OE's "reinforces plans" mechanism. Also extract the benchmark-safety correlation gap (r=0.61) as a second claim challenging USMLE-based safety marketing.
|