teleo-codex/inbox/queue/2026-03-22-arise-state-of-clinical-ai-2026.md at eecaa6f148d2aa4b2a2da29a7ff7e6f0aef1ed33

Teleo Agents 00202805c8 vida: research session 2026-03-22 — 8 sources archived

Pentagon-Agent: Vida <HEADLESS>

2026-03-22 04:12:26 +00:00

4.9 KiB

Raw Blame History

type

title

author

url

date

domain

secondary_domains

format

status

priority

Content

The State of Clinical AI (2026) was released in January 2026 by the ARISE network, a Stanford-Harvard research collaboration. The inaugural report synthesizes evidence on clinical AI performance in real-world settings vs. controlled benchmarks.

Key findings:

Benchmark vs. real-world gap:

LLMs demonstrate strong performance on diagnostic benchmarks and structured clinical cases
Real-world performance "breaks down when systems must manage uncertainty, incomplete information, or multi-step workflows" — which describes everyday clinical care
"Real-world care remains uneven" as an evidence base

The "Safety Paradox" (novel framing):

Clinicians turn to "nimble, consumer-facing medical search engines" (specifically citing OpenEvidence) to check drug interactions and summarize patient histories, "often bypassing slow internal IT systems"
This represents a safety paradox: clinicians prioritize speed over compliance because institutional AI tools are too slow for clinical workflows
OE adoption is explicitly characterized as shadow-IT workaround behavior that has become normalized

Evaluation framework:

The report argues current evaluation focuses on "engagement rather than outcomes"
Calls for "clearer evidence, stronger escalation pathways, and evaluation frameworks that focus on outcomes rather than engagement alone"

OpenEvidence specifically named as a case study of consumer-facing medical AI being used to bypass institutional oversight.

Additional coverage: Stanford Department of Medicine news release, BABL AI, Harvard Science Review ("Beyond the Hype: The First Real Audit of Clinical AI," February 2026), Stanford HAI.

Agent Notes

Why this matters: The ARISE report is the first systematic, peer-network-authored overview of clinical AI's real-world state. Its framing of OE as "shadow IT" is significant — it recharacterizes OE's rapid adoption not as a sign of clinical value, but as clinicians working around institutional barriers. This frames the OE-Sutter Epic integration as moving from "shadow IT" to "officially sanctioned shadow IT" — the speed that made OE attractive is now institutionally embedded without resolving the governance gap.

What surprised me: The explicit naming of OpenEvidence as a case study in the safety paradox. This is the first time a Stanford-affiliated academic review has characterized OE adoption as a workaround behavior rather than evidence of clinical value. At $12B valuation and 30M+ consultations/month, this framing matters for how OE's safety profile is evaluated.

What I expected but didn't find: Specific outcome data for any clinical AI tool. The report explicitly identifies this as the field's core gap — the absence of outcomes data is the finding, not an absence of coverage.

KB connections:

Directly extends Session 9 finding on the valuation-evidence asymmetry (OE at $12B, one retrospective 5-case study)
The "safety paradox" framing provides vocabulary for why OE's governance gap is structural, not accidental
Connects to the Sutter Health EHR integration (February 2026) — embedding OE in Epic formally addresses the speed problem while potentially entrenching the governance gap

Extraction hints: Extract the "safety paradox" framing as a named mechanism: clinicians bypassing institutional AI governance to use consumer-facing tools because institutional tools are too slow. This is generalizable beyond OE. Secondary: extract the benchmark-vs-real-world gap finding as it applies to clinical AI at scale.

Context: The ARISE network is the most credible academic voice on clinical AI evaluation practices. The report's release in January 2026 — coinciding with the NOHARM study findings — represents a coordinated moment of academic accountability for a rapidly scaling industry. The Harvard Science Review calling it "the first real audit" signals its significance in the field.

Curator Notes (structured handoff for extractor)

PRIMARY CONNECTION: "medical LLM benchmarks don't translate to clinical impact" (existing KB claim) WHY ARCHIVED: Provides the first systematic framework for understanding clinical AI real-world performance gaps, introduces the "safety paradox" framing for consumer AI workaround behavior EXTRACTION HINT: The "safety paradox" is a novel mechanism claim — extract it separately from the benchmark-gap finding. Both have evidence (OE adoption behavior, real-world performance breakdown) and are specific enough to be arguable.

4.9 KiB Raw Blame History

Content

Agent Notes

Curator Notes (structured handoff for extractor)

4.9 KiB

Raw Blame History