extract: 2026-03-22-stanford-harvard-noharm-clinical-llm-safety #1629

Closed
leo wants to merge 1 commit from extract/2026-03-22-stanford-harvard-noharm-clinical-llm-safety into main
5 changed files with 71 additions and 1 deletions

View file

@ -41,6 +41,12 @@ OpenEvidence reached 30M+ monthly consultations by March 2026, including a histo
ARISE report reframes OpenEvidence adoption as shadow-IT workaround behavior rather than validation of clinical value. Clinicians use OE to 'bypass slow internal IT systems' because institutional tools are too slow for clinical workflows. This suggests rapid adoption reflects institutional system failure, not OE's clinical superiority.
### Additional Evidence (challenge)
*Source: [[2026-03-22-stanford-harvard-noharm-clinical-llm-safety]] | Added: 2026-03-22*
The NOHARM study found that 76.6% of clinical AI errors are omissions (failing to recommend necessary actions). This directly challenges OpenEvidence's 'reinforces existing clinical plans' value proposition — if a physician's plan contains an omission (the most common error type), AI confirmation makes that omission more fixed rather than catching it. The tool's core mechanism may amplify the dominant failure mode rather than mitigating it.
Relevant Notes:

View file

@ -33,6 +33,12 @@ OpenEvidence's 1M daily consultations (30M+/month) with 44% of physicians expres
---
### Additional Evidence (extend)
*Source: [[2026-03-22-stanford-harvard-noharm-clinical-llm-safety]] | Added: 2026-03-22*
The NOHARM study provides a mechanism for why human-in-the-loop fails: 76.6% of clinical AI errors are omissions (missing necessary actions) rather than commissions (wrong actions). Physicians cannot catch omission errors because there's no visible mistake to override — they don't know what's missing. This means the oversight mechanism only activates for 23.4% of errors (commissions) while the dominant error type (76.6% omissions) passes through undetected.
Relevant Notes:
- [[centaur team performance depends on role complementarity not mere human-AI combination]] -- the chess centaur model does NOT generalize to clinical medicine where physician overrides degrade AI performance
- [[medical LLM benchmark performance does not translate to clinical impact because physicians with and without AI access achieve similar diagnostic accuracy in randomized trials]] -- the multi-hospital RCT found similar diagnostic accuracy with/without AI; the Stanford/Harvard study found AI alone dramatically superior

View file

@ -35,6 +35,12 @@ OpenEvidence's medRxiv preprint (November 2025) showed 24% accuracy for relevant
ARISE report identifies specific failure modes: real-world performance 'breaks down when systems must manage uncertainty, incomplete information, or multi-step workflows.' This provides mechanistic detail for why benchmark performance doesn't translate — benchmarks test pattern recognition on complete data while clinical care requires uncertainty management.
### Additional Evidence (confirm)
*Source: [[2026-03-22-stanford-harvard-noharm-clinical-llm-safety]] | Added: 2026-03-22*
The NOHARM study found only moderate correlation (r=0.61-0.64) between USMLE/benchmark performance and actual clinical safety outcomes. Models scoring 100% on USMLE still produced severe harm in 11.8-14.6% of real cases. This confirms that benchmark performance is a weak predictor of clinical utility and extends the finding from diagnostic accuracy to safety outcomes.
Relevant Notes:

View file

@ -0,0 +1,36 @@
{
"rejected_claims": [
{
"filename": "clinical-ai-errors-are-76-percent-omissions-not-commissions-making-oversight-ineffective.md",
"issues": [
"missing_attribution_extractor"
]
},
{
"filename": "medical-benchmark-performance-does-not-predict-clinical-safety-with-only-0-61-correlation-between-usmle-scores-and-harm-rates.md",
"issues": [
"missing_attribution_extractor"
]
}
],
"validation_stats": {
"total": 2,
"kept": 0,
"fixed": 6,
"rejected": 2,
"fixes_applied": [
"clinical-ai-errors-are-76-percent-omissions-not-commissions-making-oversight-ineffective.md:set_created:2026-03-22",
"clinical-ai-errors-are-76-percent-omissions-not-commissions-making-oversight-ineffective.md:stripped_wiki_link:human-in-the-loop clinical AI degrades to worse-than-AI-alon",
"clinical-ai-errors-are-76-percent-omissions-not-commissions-making-oversight-ineffective.md:stripped_wiki_link:OpenEvidence became the fastest-adopted clinical technology ",
"medical-benchmark-performance-does-not-predict-clinical-safety-with-only-0-61-correlation-between-usmle-scores-and-harm-rates.md:set_created:2026-03-22",
"medical-benchmark-performance-does-not-predict-clinical-safety-with-only-0-61-correlation-between-usmle-scores-and-harm-rates.md:stripped_wiki_link:medical LLM benchmark performance does not translate to clin",
"medical-benchmark-performance-does-not-predict-clinical-safety-with-only-0-61-correlation-between-usmle-scores-and-harm-rates.md:stripped_wiki_link:healthcare AI regulation needs blank-sheet redesign because "
],
"rejections": [
"clinical-ai-errors-are-76-percent-omissions-not-commissions-making-oversight-ineffective.md:missing_attribution_extractor",
"medical-benchmark-performance-does-not-predict-clinical-safety-with-only-0-61-correlation-between-usmle-scores-and-harm-rates.md:missing_attribution_extractor"
]
},
"model": "anthropic/claude-sonnet-4.5",
"date": "2026-03-22"
}

View file

@ -7,9 +7,13 @@ date: 2026-01-02
domain: health
secondary_domains: [ai-alignment]
format: research paper
status: unprocessed
status: enrichment
priority: high
tags: [clinical-ai-safety, llm-errors, omission-bias, noharm-benchmark, stanford, harvard, clinical-benchmarks, medical-ai]
processed_by: vida
processed_date: 2026-03-22
enrichments_applied: ["human-in-the-loop clinical AI degrades to worse-than-AI-alone because physicians both de-skill from reliance and introduce errors when overriding correct outputs.md", "OpenEvidence became the fastest-adopted clinical technology in history reaching 40 percent of US physicians daily within two years.md", "medical LLM benchmark performance does not translate to clinical impact because physicians with and without AI access achieve similar diagnostic accuracy in randomized trials.md"]
extraction_model: "anthropic/claude-sonnet-4.5"
---
## Content
@ -49,3 +53,15 @@ Related coverage: ppc.land, allhealthtech.com
PRIMARY CONNECTION: "clinical AI augments physicians but creates novel safety risks requiring centaur design" (Belief 5 supporting claim)
WHY ARCHIVED: Defines the dominant clinical AI failure mode (omission vs. commission) — directly reframes the risk profile of tools like OpenEvidence
EXTRACTION HINT: Focus on the 76.6% omission figure and its interaction with OE's "reinforces plans" mechanism. Also extract the benchmark-safety correlation gap (r=0.61) as a second claim challenging USMLE-based safety marketing.
## Key Facts
- NOHARM study tested 31 LLMs on 100 primary care cases from Stanford Health Care
- Study included 12,747 expert annotations for 4,249 clinical management options
- Cases drawn from 16,399 real electronic consultations at Stanford Health Care
- Study published to arxiv December 2025 (2512.01241)
- Findings reported by Stanford Medicine January 2, 2026
- Best performers: Gemini 2.5 Flash, LiSA 1.0 at 11.8-14.6 severe errors per 100 cases
- Worst performers: o4 mini, GPT-4o mini at 39.9-40.1 severe errors per 100 cases
- Multi-agent approach reduces harm by mean difference 8.0% (95% CI 4.0-12.1%)
- Best models outperform generalist physicians by mean difference 9.7% (95% CI 7.0-12.5%)