--- status: seed type: musing stage: developing created: 2026-03-22 last_updated: 2026-03-22 tags: [clinical-ai-safety, openevidence, automation-bias, sociodemographic-bias, noharm, llm-errors, sutter-health, semaglutide-canada, health-canada-rejection, obbba-work-requirements, belief-5-disconfirmation] --- # Research Session: Clinical AI Safety Mechanism — Reinforcement or Bias Amplification? ## Research Question **Is the clinical AI safety concern for tools like OpenEvidence primarily about automation bias/de-skilling (changing wrong decisions), or about systematic bias amplification (reinforcing existing physician biases and plan omissions at population scale)? What does the 2025-2026 evidence base on LLM systematic bias and clinical safety say about the predominant failure mode?** ## Why This Question **Session 9 (March 21) opened Direction B as the highest KB value thread:** The "OE reinforces existing plans" PMC finding (not changing decisions) appeared to WEAKEN the deskilling/automation-bias mechanism originally in Belief 5. But I flagged the alternative: if OE reinforces plans that already contain systematic biases or omissions, the safety concern shifts to population-scale amplification of existing errors. Direction B is more dangerous because it's invisible — physicians remain "competent" but systematically biased and overconfident in reinforced plans. **Keystone belief disconfirmation target — Session 10 (Belief 5):** The claim: "Clinical AI augments physicians but creates novel safety risks requiring centaur design." Session 9 complicated this by suggesting OE doesn't change decisions, weakening the known automation-bias mechanism. **What would disconfirm Belief 5's safety concern:** - Evidence that LLM clinical recommendations have minimal systematic bias (unbiased reinforcement = net positive) - Evidence that OE-type tools surface omissions and concerns that physicians miss (additive rather than confirmatory) - Evidence that physicians actively override or critically evaluate AI recommendations (automation bias minimal in practice) **What would strengthen Direction B (reinforcement-as-amplification):** - Evidence that LLMs have systematic sociodemographic biases in clinical recommendations (if OE reinforces these, it amplifies them) - Evidence that most LLM errors are omissions rather than commissions (OE confirming plans = confirming plans with omissions) - Evidence that physicians develop automation bias toward AI suggestions even when trained otherwise ## What I Found ### Core Finding 1: NOHARM Study — LLMs Make Severe Errors in 22% of Clinical Cases, 76.6% Are Omissions The Stanford/Harvard NOHARM study ("First, Do NOHARM: Towards Clinically Safe Large Language Models," arxiv 2512.01241, findings released January 2, 2026) is the most rigorous clinical AI safety evaluation to date: - 31 LLMs tested on 100 real primary care consultation cases, 10 specialties - Cases drawn from 16,399 real electronic consultations at Stanford Health Care - 12,747 expert annotations for 4,249 clinical management options - **Severe harm in up to 22.2% of cases (95% CI 21.6-22.8%)** - **Harms of OMISSION account for 76.6% of all errors** — not commissions (wrong action), but missing necessary actions - Best models (Gemini 2.5 Flash, LiSA 1.0): 11.8-14.6 severe errors per 100 cases - Worst models (o4 mini, GPT-4o mini): 39.9-40.1 severe errors per 100 cases - Safety performance ONLY MODERATELY correlated with AI benchmarks (r = 0.61-0.64) — USMLE scores don't predict clinical safety - HOWEVER: Best models outperform generalist physicians on safety (mean difference 9.7%, 95% CI 7.0-12.5%) - Multi-agent approach reduces harm vs. solo model (mean difference 8.0%, 95% CI 4.0-12.1%) **Critical connection to OE "reinforces plans" finding:** The dominant error type (76.6% omissions) DIRECTLY EXPLAINS why "reinforcement" is dangerous. If OE confirms a physician's plan that has an omission (the most common error), OE's confirmation makes the physician MORE confident in an incomplete plan. This is not "OE causes wrong actions" — it's "OE prevents the physician from recognizing what they missed." At 30M+ monthly consultations, this operates at population scale. ### Core Finding 2: Nature Medicine Sociodemographic Bias Study — Systematic Demographic Bias in All Clinical LLMs Published in Nature Medicine (2025, doi: 10.1038/s41591-025-03626-6), PubMed 40195448: - 9 LLMs evaluated, 1.7 million model-generated outputs - 1,000 ED cases (500 real, 500 synthetic) presented in 32 sociodemographic variations - Clinical details held constant — only demographic labels changed **Findings:** - Black, unhoused, LGBTQIA+ patients: more frequently directed to urgent care, invasive interventions, mental health evaluations - LGBTQIA+ subgroups: mental health assessments recommended **6-7x more often than clinically indicated** - High-income patients: significantly more advanced imaging (CT/MRI, P < 0.001) - Low/middle-income patients: limited to basic or no further testing - Bias found in BOTH proprietary AND open-source models **The "not supported by clinical reasoning or guidelines" qualifier is key:** These biases are not acceptable clinical variation — they are model-driven artifacts. They would propagate if a tool like OE "reinforces" physician plans in these demographic contexts. **Combined with NOHARM:** If OE is built on models with systematic sociodemographic biases, AND OE "reinforces" physician plans, AND physician plans are subject to the same demographic biases (physicians also show these patterns in the literature), then OE amplifies demographic bias at population scale rather than correcting it. ### Core Finding 3: Automation Bias RCT — Even AI-Trained Physicians Defer to Erroneous AI Registered clinical trial (NCT06963957), published medRxiv August 26, 2025: - Pakistan RCT (June 20-August 15, 2025), physicians from multiple institutions - All participants had completed 20-hour AI-literacy training (critical evaluation of AI output) - Randomized 1:1: control arm received correct ChatGPT-4o recommendations; treatment arm received recommendations with deliberate errors in 3 of 6 vignettes - **Result: erroneous LLM recommendations significantly degraded diagnostic performance even in AI-trained physicians** - "Voluntary deference to flawed AI output highlights critical patient safety risk" **This directly challenges the "centaur design will solve it" assumption in Belief 5.** If 20 hours of AI literacy training is insufficient to protect physicians from automation bias, the centaur model's "physician for judgment" component is more vulnerable than assumed. The physicians most likely to use OE are exactly those most likely to trust it. Related: JAMA Network Open "LLM Influence on Diagnostic Reasoning" randomized clinical trial (June 2025) — same pattern emerging across multiple experimental designs. ### Core Finding 4: Stanford-Harvard State of Clinical AI 2026 (ARISE Network) The ARISE network (Stanford-Harvard) released the "State of Clinical AI 2026" in January/February 2026: - Explicitly distinguishes "benchmark performance" from "real-world clinical performance" — the gap is large - LLMs break down for "uncertainty, incomplete information, or multi-step workflows" — everyday clinical conditions - **"Safety paradox":** Clinicians use consumer-facing tools like OE to bypass slow institutional IT governance, prioritizing speed over compliance/oversight - Evaluation frameworks must "focus on outcomes rather than engagement" - OE specifically cited as a "consumer-facing medical search engine" used to "bypass slow internal IT systems" The "safety paradox" is a new framing: the features that make OE attractive (speed, external access, consumer-grade UX) are EXACTLY the features that create governance gaps. OE adoption is driven by work-around behavior, not institutional validation. ### Core Finding 5: OpenEvidence + Sutter Health Epic EHR Integration (February 11, 2026) Announced February 11, 2026: OE is now embedded within Epic EHR workflows at Sutter Health (one of California's largest health systems, ~12,000 physicians): - Natural-language search for guidelines, studies, clinical evidence — directly within Epic - First major health system EHR integration (not just standalone app) - This transitions OE from "physician chooses to open a separate app" to "AI suggestion accessible during clinical workflow" **This significantly INCREASES automation bias risk.** Research on in-context vs. external AI suggestions consistently shows higher adherence to in-context suggestions (reduced friction = increased trust). Embedding OE in Epic's workflow architecture makes the "bypass" behavior (ARISE "safety paradox") institutionally sanctioned — the shadow IT workaround becomes the official pathway. At 30M+ monthly consultations (mostly standalone), the Sutter EHR integration could add another ~12,000 physicians with in-context OE access at a different bias level. ### Core Finding 6: Health Canada Rejects Dr. Reddy's Semaglutide Application — May 2026 Canada Launch Is Off **MAJOR UPDATE TO SESSION 9:** The March 21 session projected Dr. Reddy's launching generic semaglutide in Canada by May 2026 (Canada patent expired January 2026). This is now confirmed incorrect: - October 2025: Health Canada issued a Notice of Non-Compliance (NoN) to Dr. Reddy's for its Abbreviated New Drug Submission for generic semaglutide injection - Health Canada subsequently REJECTED the application - Delay: 8-12 months from October 2025 = earliest new submission June-October 2026, approval timeline beyond that - Dr. Reddy's Canada launch is "on pause" — company engaging with regulators - Dr. Reddy's DID launch "Obeda" in India (confirmed March 21) - Canada remains the clearest data point for a major-market generic launch, but the timeline is now 2027 at earliest **Implication for KB:** The GLP-1 generic bifurcation narrative is accurate (India Day-1 confirmed), but the Canada data point will not arrive in May 2026. US gray market pressure building slower than projected. ### Core Finding 7: OBBBA Work Requirements — All 7 State Waivers Still Pending, Jan 2027 Mandatory As of January 23, 2026: - Mandatory implementation date: **January 1, 2027** (all states, for ACA expansion group, 80 hours/month) - 7 states with pending Section 1115 waivers (early implementation): Arizona, Arkansas, Iowa, Montana, Ohio, South Carolina, Utah — ALL STILL PENDING at CMS - Nebraska: implementing via state plan amendment (no waiver), ahead of schedule - Georgia: only state with implemented work requirements (July 2023), provides the only real-world precedent - Session 9 noted 22 AGs challenging Planned Parenthood defund; work requirements themselves NOT successfully litigated - HHS interim final rule still due June 2026 **What this means:** The coverage fragmentation mechanism (Session 8 finding) is not yet operational. The 10M uninsured projection runs to 2034; the 2026 implementation timeline means data won't emerge until 2027. The VBC continuous-enrollment disruption is structural but its observable impact is ~12-18 months away. ## Synthesis: The Reinforcement-Bias Amplification Mechanism The Session 9 concern is now substantially substantiated. Here is the full mechanism: 1. **LLMs have severe error rates** (22% of clinical cases in NOHARM) predominantly through **omissions** (76.6%) 2. **OE reinforces physician plans** (PMC study, 2025) — when physician plans contain omissions, OE confirmation makes those omissions more fixed 3. **LLMs have systematic sociodemographic biases** (Nature Medicine, 2025) — racial, income, and identity biases in clinical recommendations across all tested models 4. **OE reinforcing plans with sociodemographic bias** → amplifies those biases at 30M+/month scale 5. **Automation bias is robust** (NCT06963957) — even AI-trained physicians defer to erroneous AI, so the centaur model's "physician override" assumption is weaker than Belief 5 assumed 6. **EHR embedding amplifies** — Sutter Health OE-Epic integration increases in-context automation bias beyond standalone app use **The failure mode is now clearer:** Clinical AI systems at scale are most dangerous not when they are obviously wrong (physicians override), but when they **reinforce existing plans that have invisible errors** (omissions) or **systematic biases** (demographic). This is precisely what OE appears to do. The "reinforcement" is not safety; it's a bias-fixing mechanism. **HOWEVER — the counterpoint from NOHARM:** Best models outperform generalist physicians on safety (9.7%). If OE uses best-in-class models, it may be safer than generalist physicians even with its failure modes. The net safety question is: does OE's systematic reinforcement + bias + automation-bias effect exceed the benefits of 30M monthly evidence lookups? The evidence is insufficient to resolve this, but the failure modes are now clearly documented. ## Claim Candidates CLAIM CANDIDATE 1: "The dominant failure mode of clinical LLMs is harms of omission (76.6% of severe errors in the NOHARM study of 31 models), not commissions — meaning AI-assisted confirmation of existing clinical plans is dangerous because it reinforces the most common error type rather than surfacing missing actions" - Domain: health, secondary: ai-alignment - Confidence: likely (NOHARM is peer-reviewed, 100 real cases, 31 models — robust methodology; mechanism interpretation is inference) - Sources: arxiv 2512.01241 (NOHARM), Stanford Medicine news release January 2026 - KB connections: Extends Belief 5; connects to the OE "reinforces plans" PMC finding; challenges "centaur model catches errors" assumption CLAIM CANDIDATE 2: "LLMs systematically apply different clinical standards by sociodemographic category — LGBTQIA+ patients receive mental health referrals 6-7x more often than clinically indicated, and high-income patients receive significantly more advanced imaging — across both proprietary and open-source models (Nature Medicine, 2025, n=1.7M outputs)" - Domain: health, secondary: ai-alignment - Confidence: proven (1.7M outputs, 9 LLMs, P<0.001 for income imaging, published in Nature Medicine) - Sources: Nature Medicine doi:10.1038/s41591-025-03626-6 (PubMed 40195448) - KB connections: Extends Belief 5 (clinical AI safety risks); creates connection to Belief 2 (social determinants); challenges "AI reduces health disparities" narrative CLAIM CANDIDATE 3: "Erroneous LLM recommendations significantly degrade diagnostic accuracy even in AI-trained physicians — a randomized controlled trial (NCT06963957) found physicians with 20-hour AI-literacy training still showed automation bias when given deliberately flawed ChatGPT-4o recommendations, undermining the centaur model's assumption that physician judgment provides reliable error-catching" - Domain: health, secondary: ai-alignment - Confidence: likely (RCT design is sound; Pakistan physician sample may limit generalizability; effect is directionally consistent with automation bias literature) - Sources: medRxiv doi:10.1101/2025.08.23.25334280 (NCT06963957, August 2025) - KB connections: Directly challenges the "centaur model" assumption in Belief 5; connects to Theseus's alignment work on human oversight degradation CLAIM CANDIDATE 4: "OpenEvidence's embedding in Sutter Health's Epic EHR workflows (February 2026) transitions clinical AI from voluntary shadow-IT workaround to institutionally sanctioned in-workflow tool, increasing the automation bias risk by making AI suggestions accessible in-context during clinical decision-making" - Domain: health, secondary: ai-alignment - Confidence: experimental (EHR embedding → increased automation bias is inference from automation bias literature; empirical outcome for Sutter integration is unknown) - Sources: BusinessWire February 11, 2026; Healthcare IT News; Stanford-Harvard ARISE "safety paradox" framing - KB connections: Extends the OE scale-safety asymmetry (Sessions 8-9); new structural mechanism for how OE's risk profile changes with EHR integration CLAIM CANDIDATE 5: "Health Canada's rejection of Dr. Reddy's generic semaglutide application (October 2025, confirmed) delays Canada's first major-market generic semaglutide launch from May 2026 to at minimum mid-2027, leaving India as the only large-market precedent for post-patent-expiry pricing and access dynamics" - Domain: health - Confidence: proven (Health Canada NoN is regulatory fact; timeline inference is standard 8-12 month re-submission estimate) - Sources: Business Standard October 2025; The Globe and Mail; Business Standard March 2026 (India launch of Obeda) - KB connections: Updates Session 9 finding; recalibrates the GLP-1 global generic rollout timeline ## Disconfirmation Result: Belief 5 — EXPANDED, NOT FALSIFIED **Target:** The mechanism by which clinical AI creates safety risks. The March 21 "reinforces plans" finding seemed to WEAKEN the original automation-bias/deskilling mechanism. **Search result:** Belief 5 is NOT disconfirmed. The "reinforces plans" finding is WORSE than originally characterized: - NOHARM shows 76.6% of severe LLM errors are omissions — if OE reinforces plans containing omissions, the reinforcement amplifies the most common error type - Nature Medicine sociodemographic bias study shows LLMs systematically apply biased clinical standards — OE reinforcing biased plans at 30M/month scale amplifies demographic disparities - Automation bias RCT (NCT06963957) shows even AI-trained physicians defer to flawed AI — the centaur "physician judgment" safety assumption is weaker than stated - OE-Sutter EHR integration amplifies all of the above by making suggestions in-context **However — a genuine complication:** NOHARM shows best-in-class LLMs outperform generalist physicians on safety by 9.7%. If OE uses best-in-class models, some of its reinforcement may be reinforcing CORRECT plans that physicians would otherwise have deviated from harmfully. The net safety calculation is unknown. **Net Belief 5 assessment:** Belief 5 is strengthened in the FAILURE MODE CATALOGUE. The original framing (deskilling + automation bias) is incomplete. The fuller picture is: 1. Omission-reinforcement: OE confirms plans with missing actions → omissions become fixed 2. Demographic bias amplification: OE reinforces demographically biased plans at scale 3. Automation bias robustness: even trained physicians defer to AI 4. EHR embedding: in-context suggestions increase trust 5. Scale asymmetry: 30M+/month with zero prospective outcomes evidence, now embedding in Epic ## Belief Updates **Belief 5 (clinical AI safety):** **EXPANDED AND STRENGTHENED — new failure mode catalogue.** Original concern (automation bias + deskilling) is confirmed. New and more concerning mechanisms identified: - Omission-reinforcement (most important): OE confirming plans → fixing omissions; NOHARM shows omissions = 76.6% of all severe errors - Sociodemographic bias amplification (most insidious): OE built on models with systematic demographic biases reinforces those biases at scale - Automation bias robustness (most troubling): AI literacy training insufficient to protect against automation bias (NCT06963957) **Existing "AI clinical safety risks" KB claims:** Need to incorporate the NOHARM framework's omission/commission distinction. Current claims likely frame safety as "AI gives wrong advice" (commission). More accurate: "AI confirms incomplete advice" (omission). ## Follow-up Directions ### Active Threads (continue next session) - **NCT07199231 results (OE prospective trial):** Still underway (6-month data collection). This is the most important pending data. With the NOHARM + sociodemographic bias + automation bias RCT findings now available, the NCT07199231 results will be interpretable in this richer framework. Watch for preprint Q4 2026. - **Sutter Health OE-Epic integration outcomes:** The February 2026 launch is live. Watch for: (1) any Sutter Health quality/safety reporting that mentions OE; (2) any Epic App Orchard adoption data; (3) any adverse event reports from EHR-embedded AI. This is the first real-world data point for in-workflow OE use. - **OBBBA HHS interim final rule (June 2026):** Work requirements mandatory January 1, 2027. June 2026 rule determines implementation details. Nebraska's state plan amendment approach is the most important precedent to watch. - **Dr. Reddy's Canada regulatory resubmission:** Health Canada rejected the initial application. Company engaging with regulators. Watch for: (1) news of formal re-submission; (2) any Health Canada announcement on timeline. Canada remains the most important data point for major-market generic semaglutide access and pricing. - **NOHARM follow-up studies:** The multi-agent approach reduces harm (8.0% improvement). OE uses a single model architecture. Are multi-agent clinical AI designs entering the market? This could be the next-generation safety design that outperforms centaur. ### Dead Ends (don't re-run) - **Tweet feeds:** Sessions 6-10 all confirm dead. Don't check. - **Big Tech GLP-1 adherence platform search:** No native Apple/Google/Amazon GLP-1 program exists as of March 2026. Don't re-run until a product announcement signal emerges. - **May 2026 Canada semaglutide launch tracking:** Health Canada rejected the application. Don't expect Canada data in May 2026. Reset to mid-2027 at earliest. - **OpenEvidence "reinforces plans" as safety mitigation hypothesis:** This session's evidence resolves the Session 9 branching point. "Reinforcement" is NOT a safety mitigation — it's the most dangerous mechanism given the omission-dominant error structure. Direction B is confirmed: reinforcement-as-bias-amplification is the primary concern. ### Branching Points - **NOHARM "best models outperform physicians" finding:** - Direction A: OE using best-in-class models means it's net-safer than alternatives even with its failure modes — the reinforcement concern is smaller than NOHARM's absolute benefit - Direction B: OE's specific model choice and whether it's "best in class" is unknown — if it's not a top-performing model, the 22%+ error rate applies - **Recommendation: B.** OE has never disclosed its model architecture or safety benchmark performance. The NOHARM framework is the right lens to demand this disclosure from OE. The Sutter Health integration raises the stakes for this question — an EHR-embedded tool with unknown safety benchmarks now operates at health-system scale. - **Sociodemographic bias in OE specifically:** - Direction A: Search for any OE-specific bias evaluation (has anyone tested OE's recommendations across demographic groups?) - Direction B: Assume the Nature Medicine finding applies (found in all 9 tested models, both proprietary and open-source) and focus on what the Sutter Health partnership's safety oversight includes - **Recommendation: A first.** An OE-specific bias evaluation would be higher KB value than inference from the general finding. If no evaluation exists, that absence is itself a finding worth documenting.