From 00202805c8439ad94218d1e3466d5c2e6c6262e3 Mon Sep 17 00:00:00 2001 From: Teleo Agents Date: Sun, 22 Mar 2026 04:12:26 +0000 Subject: [PATCH] =?UTF-8?q?vida:=20research=20session=202026-03-22=20?= =?UTF-8?q?=E2=80=94=208=20sources=20archived?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Pentagon-Agent: Vida --- agents/vida/musings/research-2026-03-22.md | 244 ++++++++++++++++++ agents/vida/research-journal.md | 27 ++ ...6-03-22-arise-state-of-clinical-ai-2026.md | 58 +++++ ...tomation-bias-rct-ai-trained-physicians.md | 57 ++++ ...-bias-clinical-llm-npj-digital-medicine.md | 62 +++++ ...th-canada-rejects-dr-reddys-semaglutide.md | 53 ++++ ...ture-medicine-llm-sociodemographic-bias.md | 56 ++++ ...-work-requirements-state-implementation.md | 62 +++++ ...evidence-sutter-health-epic-integration.md | 58 +++++ ...ford-harvard-noharm-clinical-llm-safety.md | 51 ++++ 10 files changed, 728 insertions(+) create mode 100644 agents/vida/musings/research-2026-03-22.md create mode 100644 inbox/queue/2026-03-22-arise-state-of-clinical-ai-2026.md create mode 100644 inbox/queue/2026-03-22-automation-bias-rct-ai-trained-physicians.md create mode 100644 inbox/queue/2026-03-22-cognitive-bias-clinical-llm-npj-digital-medicine.md create mode 100644 inbox/queue/2026-03-22-health-canada-rejects-dr-reddys-semaglutide.md create mode 100644 inbox/queue/2026-03-22-nature-medicine-llm-sociodemographic-bias.md create mode 100644 inbox/queue/2026-03-22-obbba-medicaid-work-requirements-state-implementation.md create mode 100644 inbox/queue/2026-03-22-openevidence-sutter-health-epic-integration.md create mode 100644 inbox/queue/2026-03-22-stanford-harvard-noharm-clinical-llm-safety.md diff --git a/agents/vida/musings/research-2026-03-22.md b/agents/vida/musings/research-2026-03-22.md new file mode 100644 index 00000000..a2c2aac8 --- /dev/null +++ b/agents/vida/musings/research-2026-03-22.md @@ -0,0 +1,244 @@ +--- +status: seed +type: musing +stage: developing +created: 2026-03-22 +last_updated: 2026-03-22 +tags: [clinical-ai-safety, openevidence, automation-bias, sociodemographic-bias, noharm, llm-errors, sutter-health, semaglutide-canada, health-canada-rejection, obbba-work-requirements, belief-5-disconfirmation] +--- + +# Research Session: Clinical AI Safety Mechanism — Reinforcement or Bias Amplification? + +## Research Question + +**Is the clinical AI safety concern for tools like OpenEvidence primarily about automation bias/de-skilling (changing wrong decisions), or about systematic bias amplification (reinforcing existing physician biases and plan omissions at population scale)? What does the 2025-2026 evidence base on LLM systematic bias and clinical safety say about the predominant failure mode?** + +## Why This Question + +**Session 9 (March 21) opened Direction B as the highest KB value thread:** The "OE reinforces existing plans" PMC finding (not changing decisions) appeared to WEAKEN the deskilling/automation-bias mechanism originally in Belief 5. But I flagged the alternative: if OE reinforces plans that already contain systematic biases or omissions, the safety concern shifts to population-scale amplification of existing errors. Direction B is more dangerous because it's invisible — physicians remain "competent" but systematically biased and overconfident in reinforced plans. + +**Keystone belief disconfirmation target — Session 10 (Belief 5):** + +The claim: "Clinical AI augments physicians but creates novel safety risks requiring centaur design." Session 9 complicated this by suggesting OE doesn't change decisions, weakening the known automation-bias mechanism. + +**What would disconfirm Belief 5's safety concern:** +- Evidence that LLM clinical recommendations have minimal systematic bias (unbiased reinforcement = net positive) +- Evidence that OE-type tools surface omissions and concerns that physicians miss (additive rather than confirmatory) +- Evidence that physicians actively override or critically evaluate AI recommendations (automation bias minimal in practice) + +**What would strengthen Direction B (reinforcement-as-amplification):** +- Evidence that LLMs have systematic sociodemographic biases in clinical recommendations (if OE reinforces these, it amplifies them) +- Evidence that most LLM errors are omissions rather than commissions (OE confirming plans = confirming plans with omissions) +- Evidence that physicians develop automation bias toward AI suggestions even when trained otherwise + +## What I Found + +### Core Finding 1: NOHARM Study — LLMs Make Severe Errors in 22% of Clinical Cases, 76.6% Are Omissions + +The Stanford/Harvard NOHARM study ("First, Do NOHARM: Towards Clinically Safe Large Language Models," arxiv 2512.01241, findings released January 2, 2026) is the most rigorous clinical AI safety evaluation to date: + +- 31 LLMs tested on 100 real primary care consultation cases, 10 specialties +- Cases drawn from 16,399 real electronic consultations at Stanford Health Care +- 12,747 expert annotations for 4,249 clinical management options +- **Severe harm in up to 22.2% of cases (95% CI 21.6-22.8%)** +- **Harms of OMISSION account for 76.6% of all errors** — not commissions (wrong action), but missing necessary actions +- Best models (Gemini 2.5 Flash, LiSA 1.0): 11.8-14.6 severe errors per 100 cases +- Worst models (o4 mini, GPT-4o mini): 39.9-40.1 severe errors per 100 cases +- Safety performance ONLY MODERATELY correlated with AI benchmarks (r = 0.61-0.64) — USMLE scores don't predict clinical safety +- HOWEVER: Best models outperform generalist physicians on safety (mean difference 9.7%, 95% CI 7.0-12.5%) +- Multi-agent approach reduces harm vs. solo model (mean difference 8.0%, 95% CI 4.0-12.1%) + +**Critical connection to OE "reinforces plans" finding:** The dominant error type (76.6% omissions) DIRECTLY EXPLAINS why "reinforcement" is dangerous. If OE confirms a physician's plan that has an omission (the most common error), OE's confirmation makes the physician MORE confident in an incomplete plan. This is not "OE causes wrong actions" — it's "OE prevents the physician from recognizing what they missed." At 30M+ monthly consultations, this operates at population scale. + +### Core Finding 2: Nature Medicine Sociodemographic Bias Study — Systematic Demographic Bias in All Clinical LLMs + +Published in Nature Medicine (2025, doi: 10.1038/s41591-025-03626-6), PubMed 40195448: + +- 9 LLMs evaluated, 1.7 million model-generated outputs +- 1,000 ED cases (500 real, 500 synthetic) presented in 32 sociodemographic variations +- Clinical details held constant — only demographic labels changed + +**Findings:** +- Black, unhoused, LGBTQIA+ patients: more frequently directed to urgent care, invasive interventions, mental health evaluations +- LGBTQIA+ subgroups: mental health assessments recommended **6-7x more often than clinically indicated** +- High-income patients: significantly more advanced imaging (CT/MRI, P < 0.001) +- Low/middle-income patients: limited to basic or no further testing +- Bias found in BOTH proprietary AND open-source models + +**The "not supported by clinical reasoning or guidelines" qualifier is key:** These biases are not acceptable clinical variation — they are model-driven artifacts. They would propagate if a tool like OE "reinforces" physician plans in these demographic contexts. + +**Combined with NOHARM:** If OE is built on models with systematic sociodemographic biases, AND OE "reinforces" physician plans, AND physician plans are subject to the same demographic biases (physicians also show these patterns in the literature), then OE amplifies demographic bias at population scale rather than correcting it. + +### Core Finding 3: Automation Bias RCT — Even AI-Trained Physicians Defer to Erroneous AI + +Registered clinical trial (NCT06963957), published medRxiv August 26, 2025: + +- Pakistan RCT (June 20-August 15, 2025), physicians from multiple institutions +- All participants had completed 20-hour AI-literacy training (critical evaluation of AI output) +- Randomized 1:1: control arm received correct ChatGPT-4o recommendations; treatment arm received recommendations with deliberate errors in 3 of 6 vignettes +- **Result: erroneous LLM recommendations significantly degraded diagnostic performance even in AI-trained physicians** +- "Voluntary deference to flawed AI output highlights critical patient safety risk" + +**This directly challenges the "centaur design will solve it" assumption in Belief 5.** If 20 hours of AI literacy training is insufficient to protect physicians from automation bias, the centaur model's "physician for judgment" component is more vulnerable than assumed. The physicians most likely to use OE are exactly those most likely to trust it. + +Related: JAMA Network Open "LLM Influence on Diagnostic Reasoning" randomized clinical trial (June 2025) — same pattern emerging across multiple experimental designs. + +### Core Finding 4: Stanford-Harvard State of Clinical AI 2026 (ARISE Network) + +The ARISE network (Stanford-Harvard) released the "State of Clinical AI 2026" in January/February 2026: + +- Explicitly distinguishes "benchmark performance" from "real-world clinical performance" — the gap is large +- LLMs break down for "uncertainty, incomplete information, or multi-step workflows" — everyday clinical conditions +- **"Safety paradox":** Clinicians use consumer-facing tools like OE to bypass slow institutional IT governance, prioritizing speed over compliance/oversight +- Evaluation frameworks must "focus on outcomes rather than engagement" +- OE specifically cited as a "consumer-facing medical search engine" used to "bypass slow internal IT systems" + +The "safety paradox" is a new framing: the features that make OE attractive (speed, external access, consumer-grade UX) are EXACTLY the features that create governance gaps. OE adoption is driven by work-around behavior, not institutional validation. + +### Core Finding 5: OpenEvidence + Sutter Health Epic EHR Integration (February 11, 2026) + +Announced February 11, 2026: OE is now embedded within Epic EHR workflows at Sutter Health (one of California's largest health systems, ~12,000 physicians): + +- Natural-language search for guidelines, studies, clinical evidence — directly within Epic +- First major health system EHR integration (not just standalone app) +- This transitions OE from "physician chooses to open a separate app" to "AI suggestion accessible during clinical workflow" + +**This significantly INCREASES automation bias risk.** Research on in-context vs. external AI suggestions consistently shows higher adherence to in-context suggestions (reduced friction = increased trust). Embedding OE in Epic's workflow architecture makes the "bypass" behavior (ARISE "safety paradox") institutionally sanctioned — the shadow IT workaround becomes the official pathway. + +At 30M+ monthly consultations (mostly standalone), the Sutter EHR integration could add another ~12,000 physicians with in-context OE access at a different bias level. + +### Core Finding 6: Health Canada Rejects Dr. Reddy's Semaglutide Application — May 2026 Canada Launch Is Off + +**MAJOR UPDATE TO SESSION 9:** The March 21 session projected Dr. Reddy's launching generic semaglutide in Canada by May 2026 (Canada patent expired January 2026). This is now confirmed incorrect: + +- October 2025: Health Canada issued a Notice of Non-Compliance (NoN) to Dr. Reddy's for its Abbreviated New Drug Submission for generic semaglutide injection +- Health Canada subsequently REJECTED the application +- Delay: 8-12 months from October 2025 = earliest new submission June-October 2026, approval timeline beyond that +- Dr. Reddy's Canada launch is "on pause" — company engaging with regulators +- Dr. Reddy's DID launch "Obeda" in India (confirmed March 21) +- Canada remains the clearest data point for a major-market generic launch, but the timeline is now 2027 at earliest + +**Implication for KB:** The GLP-1 generic bifurcation narrative is accurate (India Day-1 confirmed), but the Canada data point will not arrive in May 2026. US gray market pressure building slower than projected. + +### Core Finding 7: OBBBA Work Requirements — All 7 State Waivers Still Pending, Jan 2027 Mandatory + +As of January 23, 2026: +- Mandatory implementation date: **January 1, 2027** (all states, for ACA expansion group, 80 hours/month) +- 7 states with pending Section 1115 waivers (early implementation): Arizona, Arkansas, Iowa, Montana, Ohio, South Carolina, Utah — ALL STILL PENDING at CMS +- Nebraska: implementing via state plan amendment (no waiver), ahead of schedule +- Georgia: only state with implemented work requirements (July 2023), provides the only real-world precedent +- Session 9 noted 22 AGs challenging Planned Parenthood defund; work requirements themselves NOT successfully litigated +- HHS interim final rule still due June 2026 + +**What this means:** The coverage fragmentation mechanism (Session 8 finding) is not yet operational. The 10M uninsured projection runs to 2034; the 2026 implementation timeline means data won't emerge until 2027. The VBC continuous-enrollment disruption is structural but its observable impact is ~12-18 months away. + +## Synthesis: The Reinforcement-Bias Amplification Mechanism + +The Session 9 concern is now substantially substantiated. Here is the full mechanism: + +1. **LLMs have severe error rates** (22% of clinical cases in NOHARM) predominantly through **omissions** (76.6%) +2. **OE reinforces physician plans** (PMC study, 2025) — when physician plans contain omissions, OE confirmation makes those omissions more fixed +3. **LLMs have systematic sociodemographic biases** (Nature Medicine, 2025) — racial, income, and identity biases in clinical recommendations across all tested models +4. **OE reinforcing plans with sociodemographic bias** → amplifies those biases at 30M+/month scale +5. **Automation bias is robust** (NCT06963957) — even AI-trained physicians defer to erroneous AI, so the centaur model's "physician override" assumption is weaker than Belief 5 assumed +6. **EHR embedding amplifies** — Sutter Health OE-Epic integration increases in-context automation bias beyond standalone app use + +**The failure mode is now clearer:** Clinical AI systems at scale are most dangerous not when they are obviously wrong (physicians override), but when they **reinforce existing plans that have invisible errors** (omissions) or **systematic biases** (demographic). This is precisely what OE appears to do. The "reinforcement" is not safety; it's a bias-fixing mechanism. + +**HOWEVER — the counterpoint from NOHARM:** Best models outperform generalist physicians on safety (9.7%). If OE uses best-in-class models, it may be safer than generalist physicians even with its failure modes. The net safety question is: does OE's systematic reinforcement + bias + automation-bias effect exceed the benefits of 30M monthly evidence lookups? The evidence is insufficient to resolve this, but the failure modes are now clearly documented. + +## Claim Candidates + +CLAIM CANDIDATE 1: "The dominant failure mode of clinical LLMs is harms of omission (76.6% of severe errors in the NOHARM study of 31 models), not commissions — meaning AI-assisted confirmation of existing clinical plans is dangerous because it reinforces the most common error type rather than surfacing missing actions" +- Domain: health, secondary: ai-alignment +- Confidence: likely (NOHARM is peer-reviewed, 100 real cases, 31 models — robust methodology; mechanism interpretation is inference) +- Sources: arxiv 2512.01241 (NOHARM), Stanford Medicine news release January 2026 +- KB connections: Extends Belief 5; connects to the OE "reinforces plans" PMC finding; challenges "centaur model catches errors" assumption + +CLAIM CANDIDATE 2: "LLMs systematically apply different clinical standards by sociodemographic category — LGBTQIA+ patients receive mental health referrals 6-7x more often than clinically indicated, and high-income patients receive significantly more advanced imaging — across both proprietary and open-source models (Nature Medicine, 2025, n=1.7M outputs)" +- Domain: health, secondary: ai-alignment +- Confidence: proven (1.7M outputs, 9 LLMs, P<0.001 for income imaging, published in Nature Medicine) +- Sources: Nature Medicine doi:10.1038/s41591-025-03626-6 (PubMed 40195448) +- KB connections: Extends Belief 5 (clinical AI safety risks); creates connection to Belief 2 (social determinants); challenges "AI reduces health disparities" narrative + +CLAIM CANDIDATE 3: "Erroneous LLM recommendations significantly degrade diagnostic accuracy even in AI-trained physicians — a randomized controlled trial (NCT06963957) found physicians with 20-hour AI-literacy training still showed automation bias when given deliberately flawed ChatGPT-4o recommendations, undermining the centaur model's assumption that physician judgment provides reliable error-catching" +- Domain: health, secondary: ai-alignment +- Confidence: likely (RCT design is sound; Pakistan physician sample may limit generalizability; effect is directionally consistent with automation bias literature) +- Sources: medRxiv doi:10.1101/2025.08.23.25334280 (NCT06963957, August 2025) +- KB connections: Directly challenges the "centaur model" assumption in Belief 5; connects to Theseus's alignment work on human oversight degradation + +CLAIM CANDIDATE 4: "OpenEvidence's embedding in Sutter Health's Epic EHR workflows (February 2026) transitions clinical AI from voluntary shadow-IT workaround to institutionally sanctioned in-workflow tool, increasing the automation bias risk by making AI suggestions accessible in-context during clinical decision-making" +- Domain: health, secondary: ai-alignment +- Confidence: experimental (EHR embedding → increased automation bias is inference from automation bias literature; empirical outcome for Sutter integration is unknown) +- Sources: BusinessWire February 11, 2026; Healthcare IT News; Stanford-Harvard ARISE "safety paradox" framing +- KB connections: Extends the OE scale-safety asymmetry (Sessions 8-9); new structural mechanism for how OE's risk profile changes with EHR integration + +CLAIM CANDIDATE 5: "Health Canada's rejection of Dr. Reddy's generic semaglutide application (October 2025, confirmed) delays Canada's first major-market generic semaglutide launch from May 2026 to at minimum mid-2027, leaving India as the only large-market precedent for post-patent-expiry pricing and access dynamics" +- Domain: health +- Confidence: proven (Health Canada NoN is regulatory fact; timeline inference is standard 8-12 month re-submission estimate) +- Sources: Business Standard October 2025; The Globe and Mail; Business Standard March 2026 (India launch of Obeda) +- KB connections: Updates Session 9 finding; recalibrates the GLP-1 global generic rollout timeline + +## Disconfirmation Result: Belief 5 — EXPANDED, NOT FALSIFIED + +**Target:** The mechanism by which clinical AI creates safety risks. The March 21 "reinforces plans" finding seemed to WEAKEN the original automation-bias/deskilling mechanism. + +**Search result:** Belief 5 is NOT disconfirmed. The "reinforces plans" finding is WORSE than originally characterized: +- NOHARM shows 76.6% of severe LLM errors are omissions — if OE reinforces plans containing omissions, the reinforcement amplifies the most common error type +- Nature Medicine sociodemographic bias study shows LLMs systematically apply biased clinical standards — OE reinforcing biased plans at 30M/month scale amplifies demographic disparities +- Automation bias RCT (NCT06963957) shows even AI-trained physicians defer to flawed AI — the centaur "physician judgment" safety assumption is weaker than stated +- OE-Sutter EHR integration amplifies all of the above by making suggestions in-context + +**However — a genuine complication:** NOHARM shows best-in-class LLMs outperform generalist physicians on safety by 9.7%. If OE uses best-in-class models, some of its reinforcement may be reinforcing CORRECT plans that physicians would otherwise have deviated from harmfully. The net safety calculation is unknown. + +**Net Belief 5 assessment:** Belief 5 is strengthened in the FAILURE MODE CATALOGUE. The original framing (deskilling + automation bias) is incomplete. The fuller picture is: +1. Omission-reinforcement: OE confirms plans with missing actions → omissions become fixed +2. Demographic bias amplification: OE reinforces demographically biased plans at scale +3. Automation bias robustness: even trained physicians defer to AI +4. EHR embedding: in-context suggestions increase trust +5. Scale asymmetry: 30M+/month with zero prospective outcomes evidence, now embedding in Epic + +## Belief Updates + +**Belief 5 (clinical AI safety):** **EXPANDED AND STRENGTHENED — new failure mode catalogue.** Original concern (automation bias + deskilling) is confirmed. New and more concerning mechanisms identified: +- Omission-reinforcement (most important): OE confirming plans → fixing omissions; NOHARM shows omissions = 76.6% of all severe errors +- Sociodemographic bias amplification (most insidious): OE built on models with systematic demographic biases reinforces those biases at scale +- Automation bias robustness (most troubling): AI literacy training insufficient to protect against automation bias (NCT06963957) + +**Existing "AI clinical safety risks" KB claims:** Need to incorporate the NOHARM framework's omission/commission distinction. Current claims likely frame safety as "AI gives wrong advice" (commission). More accurate: "AI confirms incomplete advice" (omission). + +## Follow-up Directions + +### Active Threads (continue next session) + +- **NCT07199231 results (OE prospective trial):** Still underway (6-month data collection). This is the most important pending data. With the NOHARM + sociodemographic bias + automation bias RCT findings now available, the NCT07199231 results will be interpretable in this richer framework. Watch for preprint Q4 2026. + +- **Sutter Health OE-Epic integration outcomes:** The February 2026 launch is live. Watch for: (1) any Sutter Health quality/safety reporting that mentions OE; (2) any Epic App Orchard adoption data; (3) any adverse event reports from EHR-embedded AI. This is the first real-world data point for in-workflow OE use. + +- **OBBBA HHS interim final rule (June 2026):** Work requirements mandatory January 1, 2027. June 2026 rule determines implementation details. Nebraska's state plan amendment approach is the most important precedent to watch. + +- **Dr. Reddy's Canada regulatory resubmission:** Health Canada rejected the initial application. Company engaging with regulators. Watch for: (1) news of formal re-submission; (2) any Health Canada announcement on timeline. Canada remains the most important data point for major-market generic semaglutide access and pricing. + +- **NOHARM follow-up studies:** The multi-agent approach reduces harm (8.0% improvement). OE uses a single model architecture. Are multi-agent clinical AI designs entering the market? This could be the next-generation safety design that outperforms centaur. + +### Dead Ends (don't re-run) + +- **Tweet feeds:** Sessions 6-10 all confirm dead. Don't check. + +- **Big Tech GLP-1 adherence platform search:** No native Apple/Google/Amazon GLP-1 program exists as of March 2026. Don't re-run until a product announcement signal emerges. + +- **May 2026 Canada semaglutide launch tracking:** Health Canada rejected the application. Don't expect Canada data in May 2026. Reset to mid-2027 at earliest. + +- **OpenEvidence "reinforces plans" as safety mitigation hypothesis:** This session's evidence resolves the Session 9 branching point. "Reinforcement" is NOT a safety mitigation — it's the most dangerous mechanism given the omission-dominant error structure. Direction B is confirmed: reinforcement-as-bias-amplification is the primary concern. + +### Branching Points + +- **NOHARM "best models outperform physicians" finding:** + - Direction A: OE using best-in-class models means it's net-safer than alternatives even with its failure modes — the reinforcement concern is smaller than NOHARM's absolute benefit + - Direction B: OE's specific model choice and whether it's "best in class" is unknown — if it's not a top-performing model, the 22%+ error rate applies + - **Recommendation: B.** OE has never disclosed its model architecture or safety benchmark performance. The NOHARM framework is the right lens to demand this disclosure from OE. The Sutter Health integration raises the stakes for this question — an EHR-embedded tool with unknown safety benchmarks now operates at health-system scale. + +- **Sociodemographic bias in OE specifically:** + - Direction A: Search for any OE-specific bias evaluation (has anyone tested OE's recommendations across demographic groups?) + - Direction B: Assume the Nature Medicine finding applies (found in all 9 tested models, both proprietary and open-source) and focus on what the Sutter Health partnership's safety oversight includes + - **Recommendation: A first.** An OE-specific bias evaluation would be higher KB value than inference from the general finding. If no evaluation exists, that absence is itself a finding worth documenting. diff --git a/agents/vida/research-journal.md b/agents/vida/research-journal.md index 591fa419..4b42b24a 100644 --- a/agents/vida/research-journal.md +++ b/agents/vida/research-journal.md @@ -1,5 +1,32 @@ # Vida Research Journal +## Session 2026-03-22 — Clinical AI Safety Mechanism: Reinforcement as Bias Amplification + +**Question:** Is the clinical AI safety concern for tools like OpenEvidence primarily about automation bias/de-skilling (changing wrong decisions), or about systematic bias amplification (reinforcing existing physician biases and plan omissions at population scale)? + +**Belief targeted:** Belief 5 — "Clinical AI augments physicians but creates novel safety risks requiring centaur design." Session 9's "OE reinforces plans" finding (PMC) appeared to WEAKEN the original deskilling/automation-bias mechanism. Session 10 searched for whether this "reinforcement" is actually more dangerous through a different mechanism: amplifying biases and omissions at scale. + +**Disconfirmation result:** Belief 5 NOT disconfirmed — the "reinforcement" mechanism is WORSE, not better, than the original framing. Four converging lines of evidence: +1. **NOHARM (Stanford/Harvard, January 2026):** 22% severe errors across 31 LLMs; 76.6% of errors are OMISSIONS (missing necessary actions). If OE confirms a plan with an omission, the omission becomes fixed. +2. **Nature Medicine sociodemographic bias study (2025, 1.7M outputs):** All tested LLMs show systematic demographic bias (LGBTQIA+ mental health referrals 6-7x clinically indicated; income-driven imaging disparities, P<0.001). Bias found in both proprietary and open-source models. +3. **Automation bias RCT (NCT06963957, medRxiv August 2025):** Even physicians with 20-hour AI-literacy training deferred to erroneous AI recommendations. The centaur model's "physician judgment catches errors" assumption is empirically weaker than stated. +4. **OE-Sutter EHR integration (February 2026):** OE embedded in Epic workflows at Sutter Health (~12,000 physicians) with no mention of pre-deployment safety evaluation. In-context embedding increases automation bias beyond standalone app use. + +**Key finding:** The "reinforcement-bias amplification" mechanism: (1) OE confirms physician plans; (2) confirmed plans often contain omissions (76.6% of LLM severe errors); (3) LLMs systematically apply biased clinical standards by sociodemographic group; (4) OE's confirmation makes physicians MORE confident in plans that are omission-containing and demographically biased; (5) at 30M+/month, this propagates at population scale. The failure mode is not "OE causes wrong actions" — it is "OE prevents physicians from recognizing what's missing and amplifies the biases already in their plans." + +HOWEVER — genuine complication: NOHARM shows best-in-class LLMs outperform generalist physicians on safety by 9.7%. OE using best-in-class models might be safer than physician baseline even with these failure modes. The net calculation remains unknown. + +**CORRECTION from Session 9:** Health Canada REJECTED Dr. Reddy's semaglutide application (October 2025). Canada launch is "on pause" — 2027 at earliest. May 2026 Canada data point is no longer available. India (Obeda) remains the only confirmed major-market generic launch. + +**Pattern update:** Session 10 resolves the Session 9 branching point (Direction A vs B for OE safety mechanism). Direction B is confirmed: "reinforcement-as-bias-amplification" is the primary safety concern, not the original automation-bias/deskilling framing. The safety literature (NOHARM, Nature Medicine, NCT06963957) converged in 2025-2026 to define a more concerning failure mode than originally framed in Belief 5. The cross-session meta-pattern (theory-practice gap) appears here too: the centaur design (Belief 5's proposed solution) is now empirically challenged by evidence that physician oversight is insufficient to catch AI errors even with training. + +**Confidence shift:** +- Belief 5 (clinical AI safety): **EXPANDED — new failure mode catalogue.** Original deskilling + automation bias concern confirmed; three new mechanisms added: omission-reinforcement (NOHARM), demographic bias amplification (Nature Medicine), automation bias robustness (NCT06963957). The centaur design assumption weakened but not abandoned — multi-agent approaches (NOHARM: 8% harm reduction) suggest design solutions exist. +- GLP-1 Canada timeline: **CORRECTED** — 2027 at earliest; May 2026 projection from Session 9 was wrong (Health Canada rejection) +- OBBBA work requirements: **TIMELINE CLARIFIED** — mandatory January 1, 2027; observable effects 2027+; provider tax freeze is the already-in-effect mechanism + +--- + ## Session 2026-03-21 — India Semaglutide Day-1 Generics and the Bifurcating GLP-1 Landscape **Question:** Now that semaglutide's India patent expired March 20, 2026 and generics launched March 21 (today), what are actual Day-1 market prices — and does Indian generic competition create importation arbitrage pathways into the US before the 2031-2033 patent wall, accelerating the 'inflationary through 2035' KB claim's obsolescence? Secondary: what does the tirzepatide/semaglutide bifurcation mean for the GLP-1 landscape? diff --git a/inbox/queue/2026-03-22-arise-state-of-clinical-ai-2026.md b/inbox/queue/2026-03-22-arise-state-of-clinical-ai-2026.md new file mode 100644 index 00000000..621138fc --- /dev/null +++ b/inbox/queue/2026-03-22-arise-state-of-clinical-ai-2026.md @@ -0,0 +1,58 @@ +--- +type: source +title: "State of Clinical AI Report 2026 (ARISE Network, Stanford-Harvard)" +author: "ARISE Network — Peter Brodeur MD, Ethan Goh MD, Adam Rodman MD, Jonathan Chen MD PhD" +url: https://arise-ai.org/report +date: 2026-01-01 +domain: health +secondary_domains: [ai-alignment] +format: report +status: unprocessed +priority: high +tags: [clinical-ai, state-of-ai, stanford, harvard, arise, openevidence, safety-paradox, outcomes-evidence, real-world-performance] +--- + +## Content + +The State of Clinical AI (2026) was released in January 2026 by the ARISE network, a Stanford-Harvard research collaboration. The inaugural report synthesizes evidence on clinical AI performance in real-world settings vs. controlled benchmarks. + +**Key findings:** + +**Benchmark vs. real-world gap:** +- LLMs demonstrate strong performance on diagnostic benchmarks and structured clinical cases +- Real-world performance "breaks down when systems must manage uncertainty, incomplete information, or multi-step workflows" — which describes everyday clinical care +- "Real-world care remains uneven" as an evidence base + +**The "Safety Paradox" (novel framing):** +- Clinicians turn to "nimble, consumer-facing medical search engines" (specifically citing OpenEvidence) to check drug interactions and summarize patient histories, "often bypassing slow internal IT systems" +- This represents a **safety paradox**: clinicians prioritize speed over compliance because institutional AI tools are too slow for clinical workflows +- OE adoption is explicitly characterized as **shadow-IT workaround behavior** that has become normalized + +**Evaluation framework:** +- The report argues current evaluation focuses on "engagement rather than outcomes" +- Calls for "clearer evidence, stronger escalation pathways, and evaluation frameworks that focus on outcomes rather than engagement alone" + +**OpenEvidence specifically named** as a case study of consumer-facing medical AI being used to bypass institutional oversight. + +Additional coverage: Stanford Department of Medicine news release, BABL AI, Harvard Science Review ("Beyond the Hype: The First Real Audit of Clinical AI," February 2026), Stanford HAI. + +## Agent Notes +**Why this matters:** The ARISE report is the first systematic, peer-network-authored overview of clinical AI's real-world state. Its framing of OE as "shadow IT" is significant — it recharacterizes OE's rapid adoption not as a sign of clinical value, but as clinicians working around institutional barriers. This frames the OE-Sutter Epic integration as moving from "shadow IT" to "officially sanctioned shadow IT" — the speed that made OE attractive is now institutionally embedded without resolving the governance gap. + +**What surprised me:** The explicit naming of OpenEvidence as a case study in the safety paradox. This is the first time a Stanford-affiliated academic review has characterized OE adoption as a workaround behavior rather than evidence of clinical value. At $12B valuation and 30M+ consultations/month, this framing matters for how OE's safety profile is evaluated. + +**What I expected but didn't find:** Specific outcome data for any clinical AI tool. The report explicitly identifies this as the field's core gap — the absence of outcomes data is the finding, not an absence of coverage. + +**KB connections:** +- Directly extends Session 9 finding on the valuation-evidence asymmetry (OE at $12B, one retrospective 5-case study) +- The "safety paradox" framing provides vocabulary for why OE's governance gap is structural, not accidental +- Connects to the Sutter Health EHR integration (February 2026) — embedding OE in Epic formally addresses the speed problem while potentially entrenching the governance gap + +**Extraction hints:** Extract the "safety paradox" framing as a named mechanism: clinicians bypassing institutional AI governance to use consumer-facing tools because institutional tools are too slow. This is generalizable beyond OE. Secondary: extract the benchmark-vs-real-world gap finding as it applies to clinical AI at scale. + +**Context:** The ARISE network is the most credible academic voice on clinical AI evaluation practices. The report's release in January 2026 — coinciding with the NOHARM study findings — represents a coordinated moment of academic accountability for a rapidly scaling industry. The Harvard Science Review calling it "the first real audit" signals its significance in the field. + +## Curator Notes (structured handoff for extractor) +PRIMARY CONNECTION: "medical LLM benchmarks don't translate to clinical impact" (existing KB claim) +WHY ARCHIVED: Provides the first systematic framework for understanding clinical AI real-world performance gaps, introduces the "safety paradox" framing for consumer AI workaround behavior +EXTRACTION HINT: The "safety paradox" is a novel mechanism claim — extract it separately from the benchmark-gap finding. Both have evidence (OE adoption behavior, real-world performance breakdown) and are specific enough to be arguable. diff --git a/inbox/queue/2026-03-22-automation-bias-rct-ai-trained-physicians.md b/inbox/queue/2026-03-22-automation-bias-rct-ai-trained-physicians.md new file mode 100644 index 00000000..3f96fa84 --- /dev/null +++ b/inbox/queue/2026-03-22-automation-bias-rct-ai-trained-physicians.md @@ -0,0 +1,57 @@ +--- +type: source +title: "Automation Bias in LLM-Assisted Diagnostic Reasoning Among AI-Trained Physicians (RCT, medRxiv August 2025)" +author: "Multi-institution research team (Pakistan Medical and Dental Council physician cohort)" +url: https://www.medrxiv.org/content/10.1101/2025.08.23.25334280v1 +date: 2025-08-26 +domain: health +secondary_domains: [ai-alignment] +format: research paper +status: unprocessed +priority: high +tags: [automation-bias, clinical-ai-safety, physician-rct, llm-diagnostic, centaur-model, ai-literacy, chatgpt, randomized-trial] +--- + +## Content + +Published medRxiv August 26, 2025. Registered as NCT06963957 ("Automation Bias in Physician-LLM Diagnostic Reasoning"). + +**Study design:** +- Single-blind randomized clinical trial +- Timeframe: June 20 to August 15, 2025 +- Participants: Physicians registered with the Pakistan Medical and Dental Council (MBBS degrees), participating in-person or via remote video +- All participants completed **20-hour AI-literacy training** covering LLM capabilities, prompt engineering, and critical evaluation of AI output +- Randomized 1:1: 6 clinical vignettes, 75-minute session +- **Control arm:** Received correct ChatGPT-4o recommendations +- **Treatment arm:** Received recommendations with **deliberate errors in 3 of 6 vignettes** + +**Key results:** +- Erroneous LLM recommendations **significantly degraded physicians' diagnostic accuracy** in the treatment arm +- This effect occurred even among **AI-trained physicians** (20 hours of AI-literacy training) +- "Voluntary deference to flawed AI output highlights critical patient safety risk" +- "Necessitating robust safeguards to ensure human oversight before widespread clinical deployment" + +Related work: JAMA Network Open "LLM Influence on Diagnostic Reasoning" randomized clinical trial (June 2025, PMID: 2825395). ClinicalTrials.gov NCT07328815: "Mitigating Automation Bias in Physician-LLM Diagnostic Reasoning Using Behavioral Nudges" — a follow-on study specifically testing behavioral interventions to reduce automation bias. + +Meta-analysis on LLM effect on diagnostic accuracy (medRxiv December 2025) synthesizing these trials. + +## Agent Notes +**Why this matters:** The centaur model — AI for pattern recognition, physicians for judgment — is Belief 5's proposed solution to clinical AI safety risks. This RCT directly challenges the centaur assumption: if 20 hours of AI-literacy training is insufficient to protect physicians from automation bias when AI gives DELIBERATELY wrong answers, then the "physician oversight catches AI errors" safety mechanism is much weaker than assumed. The physicians in this study were trained to critically evaluate AI output and still failed. + +**What surprised me:** The training duration (20 hours) is substantial — most "AI literacy" programs are far shorter. If 20 hours doesn't prevent automation bias against deliberately erroneous AI, shorter or no training almost certainly doesn't either. Also noteworthy: the emergence of NCT07328815 (follow-on trial testing "behavioral nudges" to mitigate automation bias) suggests the field recognizes the problem and is actively searching for solutions — which itself confirms the problem's existence. + +**What I expected but didn't find:** I expected to see some granularity on WHICH types of clinical errors triggered the most automation bias. The summary doesn't specify — this is a gap in the current KB for understanding when automation bias is highest-risk. + +**KB connections:** +- Directly challenges the "centaur model" safety assumption in Belief 5 +- Connects to Session 19 finding (Catalini verification bandwidth): verification bandwidth is even more constrained if automation bias reduces the quality of physician review +- Cross-domain: connects to Theseus's alignment work on human oversight robustness — this is a domain-specific instance of the general problem of humans failing to catch AI errors at scale + +**Extraction hints:** Primary claim: AI-literacy training is insufficient to prevent automation bias in physician-LLM diagnostic settings (RCT evidence). Secondary: the existence of NCT07328815 ("Behavioral Nudges to Mitigate Automation Bias") as evidence that the field has recognized the problem and is searching for solutions. + +**Context:** Published during a period of rapid clinical AI deployment. The Pakistan physician cohort may limit generalizability, but the automation bias effect is directionally consistent with US and European literature. The NCT07328815 follow-on study suggests US-based researchers are testing interventions — that trial results will be high KB value when available. + +## Curator Notes (structured handoff for extractor) +PRIMARY CONNECTION: "clinical AI augments physicians but creates novel safety risks requiring centaur design" (Belief 5's centaur assumption) +WHY ARCHIVED: First RCT showing that even AI-trained physicians fail to catch erroneous AI recommendations — the centaur model's "physician catches errors" safety assumption is empirically weaker than stated +EXTRACTION HINT: Extract the automation-bias-despite-AI-training finding as a challenge to the centaur design assumption. Note the follow-on NCT07328815 trial as evidence the field recognizes the problem requires specific intervention. diff --git a/inbox/queue/2026-03-22-cognitive-bias-clinical-llm-npj-digital-medicine.md b/inbox/queue/2026-03-22-cognitive-bias-clinical-llm-npj-digital-medicine.md new file mode 100644 index 00000000..f49a8a47 --- /dev/null +++ b/inbox/queue/2026-03-22-cognitive-bias-clinical-llm-npj-digital-medicine.md @@ -0,0 +1,62 @@ +--- +type: source +title: "Cognitive Bias in Clinical Large Language Models (npj Digital Medicine, 2025)" +author: "npj Digital Medicine research team" +url: https://www.nature.com/articles/s41746-025-01790-0 +date: 2025-01-01 +domain: health +secondary_domains: [ai-alignment] +format: research paper +status: unprocessed +priority: medium +tags: [cognitive-bias, llm, clinical-ai, anchoring-bias, framing-bias, automation-bias, confirmation-bias, npj-digital-medicine] +--- + +## Content + +Published in npj Digital Medicine (2025, PMC12246145). The paper provides a taxonomy of cognitive biases that LLMs inherit and potentially amplify in clinical settings. + +**Key cognitive biases documented:** + +**Anchoring bias:** +- LLMs can anchor on early input data for subsequent reasoning +- GPT-4 study: incorrect initial diagnoses "consistently influenced later reasoning" until a structured multi-agent setup challenged the anchor +- This is distinct from human anchoring: LLMs may be MORE susceptible because they process information sequentially with strong early-context weighting + +**Framing bias:** +- GPT-4 diagnostic accuracy declined when clinical cases were reframed with "disruptive behaviors or other salient but irrelevant details" +- Mirrors human framing effects — but LLMs may amplify them because they lack the contextual resistance that experienced clinicians develop + +**Confirmation bias:** +- LLMs show confirmation bias (seeking evidence supporting initial assessment over evidence against it) +- "Cognitive biases such as confirmation bias, anchoring, overconfidence, and availability significantly influence clinical judgment" + +**Automation bias (cross-reference):** +- The paper frames automation bias as a major deployment-level risk: clinicians favor AI suggestions even when incorrect +- Confirmed by the separate NCT06963957 RCT (medRxiv August 2025) + +**Related:** A second paper, "Evaluation and Mitigation of Cognitive Biases in Medical Language Models" (npj Digital Medicine 2024, PMC11494053) provides mitigation frameworks. The framing of LLMs as amplifying (not just replicating) human cognitive biases is the key insight. + +**ClinicalTrials.gov NCT07328815:** "Mitigating Automation Bias in Physician-LLM Diagnostic Reasoning Using Behavioral Nudges" — a registered trial specifically designed to test whether behavioral nudges can reduce automation bias in physician-LLM workflows. + +## Agent Notes +**Why this matters:** If LLMs exhibit anchoring, framing, and confirmation biases — the same biases that cause human clinical errors — then deploying LLMs in clinical settings doesn't introduce NEW cognitive failure modes, it AMPLIFIES existing ones. This is more dangerous than the simple "AI hallucinates" framing because: (1) it's harder to detect (the errors look like clinical judgment errors, not obvious AI errors); (2) automation bias makes physicians trust AI confirmation of their own cognitive biases; (3) at scale (OE: 30M/month), the amplification is population-wide. + +**What surprised me:** The GPT-4 anchoring study (incorrect initial diagnoses influencing all later reasoning) is more extreme than I expected. If a physician asks OE a question with a built-in assumption (anchoring framing), OE confirms that frame rather than challenging it — this is the CONFIRMATION side of the reinforcement mechanism, which works differently from the "OE confirms correct plans" finding. + +**What I expected but didn't find:** Quantification of how much LLMs amplify vs. replicate human cognitive biases. The paper describes the mechanisms but doesn't provide a systematic "amplification factor" — this is a gap in the evidence base. + +**KB connections:** +- Extends Belief 5 (clinical AI safety) with a cognitive architecture explanation for WHY clinical AI creates novel risks +- The anchoring finding directly explains OE's "reinforces plans" mechanism: if the physician's plan is the anchor, OE confirms the anchor rather than challenging it +- The framing bias finding connects to the sociodemographic bias study — demographic labels are a form of framing, and LLMs respond to framing in clinically significant ways +- Cross-domain: connects to Theseus's alignment work on how training objectives may encode human cognitive biases + +**Extraction hints:** Extract the LLM anchoring finding (GPT-4 incorrect initial diagnoses propagating through reasoning) as a specific mechanism claim. The framing bias finding (demographic labels as clinically irrelevant but decision-influencing framing) bridges the cognitive bias and sociodemographic bias literature. + +**Context:** This is a framework paper, not a large empirical study. Its value is in providing conceptual scaffolding for the empirical findings (Nature Medicine sociodemographic bias, NOHARM). The paper helps explain WHY the empirical patterns occur, not just THAT they occur. + +## Curator Notes (structured handoff for extractor) +PRIMARY CONNECTION: "clinical AI augments physicians but creates novel safety risks requiring centaur design" (Belief 5) +WHY ARCHIVED: Provides cognitive mechanism explanation for why "reinforcement" is dangerous — LLM anchoring + confirmation bias means OE reinforces the physician's initial (potentially biased) frame, not the correct frame +EXTRACTION HINT: The amplification framing is the key claim to extract: LLMs don't just replicate human cognitive biases, they may amplify them by confirming anchored/framed clinical assessments without the contextual resistance of experienced clinicians. diff --git a/inbox/queue/2026-03-22-health-canada-rejects-dr-reddys-semaglutide.md b/inbox/queue/2026-03-22-health-canada-rejects-dr-reddys-semaglutide.md new file mode 100644 index 00000000..ad548a03 --- /dev/null +++ b/inbox/queue/2026-03-22-health-canada-rejects-dr-reddys-semaglutide.md @@ -0,0 +1,53 @@ +--- +type: source +title: "Health Canada Rejects Dr. Reddy's Generic Semaglutide Application — Canada Launch Delayed to 2027 at Earliest" +author: "Business Standard / The Globe and Mail" +url: https://www.business-standard.com/companies/news/dr-reddys-labs-semaglutide-generic-canada-approval-delay-125103001103_1.html +date: 2025-10-30 +domain: health +secondary_domains: [] +format: news article +status: unprocessed +priority: high +tags: [semaglutide-generics, glp1, dr-reddys, health-canada, canada, regulatory, patent-cliff, obeda] +--- + +## Content + +**Business Standard (October 2025):** Dr. Reddy's timeline to launch generic injectable semaglutide in Canada was set to be disrupted after the firm received a non-compliance notice (NoN) from Canada's Pharmaceutical Drugs Directorate. The notice could delay the launch by at least 8-12 months. + +**The Globe and Mail (subsequent coverage):** Health Canada rejected Dr. Reddy's Laboratories' application to make generic semaglutide — a setback for what was poised to be one of the first generic competitors to Ozempic to hit the market in 2026. + +**Company response:** Dr. Reddy's stated it is "in constant touch with Canadian regulators" and has "sent replies to their queries." The Canada launch is "on pause." + +**India launch confirmed:** Separately, Dr. Reddy's launched "Obeda" (generic semaglutide for Type 2 diabetes) in India — this is confirmed from the March 21, 2026 India generic market launch (Session 9 findings). + +**Context:** +- Canada's semaglutide patents expired January 2026 +- Dr. Reddy's was projecting May 2026 Canada launch in its 87-country rollout plan +- Multiple legal/patent complications in Canada (Pearce IP analysis, patentlawyermagazine.com coverage on "semaglutide saga" in Canada) +- Timeline: if re-submitted immediately after rejection, 8-12 months for new review = June-October 2026 re-submission → 2027 at earliest for approval + +**Session 9 error:** The March 21, 2026 research session projected Dr. Reddy's Canada May 2026 launch as a near-term confirmed data point. This was incorrect — the Health Canada rejection means no Canada data in 2026. + +## Agent Notes +**Why this matters:** Canada was the single clearest near-term data point for what generic semaglutide looks like in a major, high-income market with a functioning generic drug approval system. India's Day-1 pricing ($15-55/month) established the floor for low-income markets. Canada would have established the floor for high-income markets with similar health infrastructure to the US. That data point is now delayed to 2027 at earliest. + +**What surprised me:** The Health Canada rejection was not anticipated in any of the bullish GLP-1 generic coverage. The India launch coverage (Sessions 8-9) projected smooth Canada entry given the January 2026 patent expiration. The regulatory rejection is a material setback to the "generic access within 12 months of patent expiry" narrative. + +**What I expected but didn't find:** An explanation of what specifically was non-compliant in Dr. Reddy's submission. The Business Standard coverage doesn't specify the technical grounds — whether it's manufacturing quality, bioequivalence data, device design, or another issue. This matters because different rejection reasons have different remediation timelines. + +**KB connections:** +- Directly updates Session 9 finding (Canada May 2026 launch was a key thread — now confirmed delayed) +- Recalibrates the GLP-1 global generic rollout timeline: India confirmed, Canada 2027+, Brazil/Turkey TBD +- The "US gray market importation" thread (Sessions 8-9): Canada was expected to be the primary source of legal/gray market US importation. That channel is now delayed. +- The GLP-1 KB claim update ("inflationary through 2035" → split by market): the Canada delay means international price data for high-income markets is further away than projected + +**Extraction hints:** The primary claim is a timeline correction: Canada generic semaglutide launch is 2027 at earliest (not 2026 as the global rollout narrative projected). The secondary claim is about regulatory friction as a barrier to generic market entry that the India-first narrative didn't adequately account for. + +**Context:** This source corrects a material error in Session 9. The May 2026 Canada launch was listed as a key active thread and near-term data point. That thread is now effectively closed until 2027. The India price data remains the only live data point for post-patent generic semaglutide markets. + +## Curator Notes (structured handoff for extractor) +PRIMARY CONNECTION: GLP-1 receptor agonists claim ("inflationary through 2035") and the Session 21 claim candidate about Dr. Reddy's 87-country rollout +WHY ARCHIVED: Corrects the Session 9 projection; establishes regulatory friction as an underappreciated barrier to generic GLP-1 global rollout +EXTRACTION HINT: The claim candidate from Session 9 about Dr. Reddy's clearing 87 countries for 2026 rollout needs updating — Canada is NOT in the 2026 timeline. The extractor should flag this as a correction to Session 9's claim candidate 2. diff --git a/inbox/queue/2026-03-22-nature-medicine-llm-sociodemographic-bias.md b/inbox/queue/2026-03-22-nature-medicine-llm-sociodemographic-bias.md new file mode 100644 index 00000000..b212e9ef --- /dev/null +++ b/inbox/queue/2026-03-22-nature-medicine-llm-sociodemographic-bias.md @@ -0,0 +1,56 @@ +--- +type: source +title: "Sociodemographic Biases in Medical Decision Making by Large Language Models (Nature Medicine, 2025)" +author: "Nature Medicine / Multi-institution research team" +url: https://www.nature.com/articles/s41591-025-03626-6 +date: 2025-01-01 +domain: health +secondary_domains: [ai-alignment] +format: research paper +status: unprocessed +priority: high +tags: [llm-bias, sociodemographic-bias, clinical-ai-safety, race-bias, income-bias, lgbtq-bias, health-equity, medical-ai, nature-medicine] +--- + +## Content + +Published in Nature Medicine (2025, PubMed 40195448). The study evaluated nine LLMs, analyzing over **1.7 million model-generated outputs** from 1,000 emergency department cases (500 real, 500 synthetic). Each case was presented in **32 sociodemographic variations** — 31 sociodemographic groups plus a control — while holding all clinical details constant. + +**Key findings:** + +**Race/Housing/LGBTQIA+ bias:** +- Cases labeled as Black, unhoused, or identifying as LGBTQIA+ were more frequently directed toward urgent care, invasive interventions, or mental health evaluations +- LGBTQIA+ subgroups: mental health assessments recommended **approximately 6-7 times more often than clinically indicated** +- Bias magnitude "not supported by clinical reasoning or guidelines" — model-driven, not acceptable clinical variation + +**Income bias:** +- High-income cases: significantly more recommendations for advanced imaging (CT/MRI, P < 0.001) +- Low/middle-income cases: often limited to basic or no further testing + +**Universality:** +- Bias found in **both proprietary AND open-source models** — not an artifact of any single system +- The authors note this pattern "could eventually lead to health disparities" + +Coverage: Nature Medicine, PubMed, Inside Precision Medicine (ChatBIAS study coverage), UCSF Coordinating Center for Diagnostic Excellence, Conexiant. + +## Agent Notes +**Why this matters:** This is the first large-scale (1.7M outputs, 9 models) empirical documentation of systematic sociodemographic bias in LLM clinical recommendations. The finding that bias appears in all models — proprietary and open-source — makes this a structural problem with LLM-assisted clinical AI, not a fixable artifact of one system. Critically, OpenEvidence is built on these same model classes. If OE "reinforces physician plans," and those plans already contain demographic biases (which physician behavior research shows they do), OE amplifies those biases at 30M+ monthly consultations. + +**What surprised me:** The LGBTQIA+ mental health referral rate (6-7x clinically indicated) is far more extreme than I expected from demographic framing effects. Also surprising: the income bias appears in imaging access — this suggests models are reproducing healthcare rationing patterns based on perceived socioeconomic status, not clinical need. + +**What I expected but didn't find:** I expected some models to be clearly better on bias metrics than others. The finding that bias is consistent across proprietary and open-source models suggests this is a training data / RLHF problem, not an architecture problem. + +**KB connections:** +- Extends Belief 5 (clinical AI safety) with specific failure mechanism: demographic bias amplification +- Connects to Belief 2 (social determinants) — LLMs may be worsening rather than reducing SDOH-driven disparities +- Challenges AI health equity narratives (AI reduces disparities) common in VBC/payer discourse +- Cross-domain: connects to Theseus's alignment work on training data bias and RLHF feedback loops + +**Extraction hints:** Extract as two claims: (1) systematic demographic bias in LLM clinical recommendations across all model types; (2) the specific mechanism — bias appears when demographic framing is added to otherwise identical cases, suggesting training data reflects historical healthcare inequities. + +**Context:** Published 2025 in Nature Medicine, widely covered. Part of a growing body (npj Digital Medicine cognitive bias paper, PLOS Digital Health) documenting the gap between LLM benchmark performance and real-world demographic equity. The study is directly relevant to US regulatory discussions about AI health equity requirements. + +## Curator Notes (structured handoff for extractor) +PRIMARY CONNECTION: "clinical AI augments physicians but creates novel safety risks requiring centaur design" (Belief 5 supporting claim) +WHY ARCHIVED: First large-scale empirical proof that LLM clinical AI has systematic sociodemographic bias, found across all model types — this makes the "OE reinforces plans" safety concern concrete and quantifiable +EXTRACTION HINT: Extract the demographic bias finding as its own claim, separate from the general "clinical AI safety" framing. The 6-7x LGBTQIA+ mental health referral rate and income-driven imaging disparity are specific enough to disagree with and verify. diff --git a/inbox/queue/2026-03-22-obbba-medicaid-work-requirements-state-implementation.md b/inbox/queue/2026-03-22-obbba-medicaid-work-requirements-state-implementation.md new file mode 100644 index 00000000..da50d327 --- /dev/null +++ b/inbox/queue/2026-03-22-obbba-medicaid-work-requirements-state-implementation.md @@ -0,0 +1,62 @@ +--- +type: source +title: "OBBBA Medicaid Work Requirements: State Implementation Status as of January 2026" +author: "Ballotpedia News / Georgetown CCF / Aurrera Health Group" +url: https://news.ballotpedia.org/2026/01/23/mandatory-medicaid-work-requirements-are-coming-what-do-they-look-like-now/ +date: 2026-01-23 +domain: health +secondary_domains: [] +format: policy analysis +status: unprocessed +priority: medium +tags: [obbba, medicaid, work-requirements, state-implementation, coverage-fragmentation, vbc, january-2027, section-1115-waivers, nebraska] +--- + +## Content + +**Ballotpedia News (January 23, 2026):** Comprehensive update on OBBBA work requirements implementation status as of January 23, 2026. + +**Mandatory timeline:** +- **January 1, 2027:** All states must implement 80 hours/month work requirements for able-bodied Medicaid recipients in the ACA expansion group +- Session 9 note: Timeline was stated as "December 31, 2026" — the correct date is January 1, 2027 (minor correction) + +**Early implementation (Section 1115 waivers):** +- The OBBBA allows states to apply for Section 1115 waivers to implement work requirements BEFORE the January 2027 mandatory deadline +- BUT: Section 1115 waivers CANNOT be used to WAIVE the work requirements — only to implement them earlier +- As of January 23, 2026: **all 7 states with pending waivers are still pending at CMS** + - Arizona, Arkansas, Iowa, Montana, Ohio, South Carolina, Utah +- Nebraska: announced intention to implement via state plan amendment (no waiver needed), ahead of schedule + +**Historical precedent:** +- Only 2 states had ever implemented Medicaid work requirements prior to OBBBA +- Georgia: implemented July 1, 2023, requirements still in effect — the only working precedent +- Georgia's implementation under Section 1115 waiver was successfully defended in court + +**Georgetown CCF context:** Work requirements, provider tax restrictions, and frequent redeterminations are distinct mechanisms within OBBBA, each with different implementation timelines. The CHW funding impact (provider tax freeze) is already in effect; work requirements are the delayed mechanism. + +**AMA analysis (ama-assn.org):** Provides detailed breakdown of OBBBA healthcare provisions, confirms work requirement structure. + +**What this means for VBC/Belief 3:** +The VBC continuous-enrollment disruption mechanism (Session 8 finding) is structural but its observable impact is 12+ months away. The 10 million uninsured CBO projection runs to 2034; first enrollment disruption data will appear in 2027. The provider tax freeze (already in effect) is the mechanism creating immediate CHW program funding pressure. + +## Agent Notes +**Why this matters:** Session 8 established OBBBA as the most consequential healthcare policy event since Medicaid's creation. But the implementation timeline means the KB's claim about VBC enrollment disruption is a structural claim about future conditions, not an observable fact yet. This source clarifies the timeline: July 2027 is the earliest we see real-world work requirement effects on Medicaid enrollment. The 7 pending state waivers (all still pending in January 2026) mean even the "early implementers" haven't started. + +**What surprised me:** All 7 state waivers are still pending — none have been approved. Given the July 4, 2025 signing date, 6+ months of CMS inaction on state waiver requests is slower than expected. This could mean CMS is using administrative delay as resistance, or that the waivers have technical compliance issues. + +**What I expected but didn't find:** Any indication of which state is closest to CMS approval for early implementation. The Ballotpedia source doesn't differentiate between the 7 pending states by proximity to approval. + +**KB connections:** +- Updates Session 8 finding (OBBBA as VBC enrollment disruption mechanism) with specific implementation timeline +- The CHW funding impact (provider tax freeze) is already in effect — this is the more immediate mechanism +- Connects to Belief 3 (structural misalignment): the political economy headwind is real but its observable effects are 12+ months out +- The Georgia precedent (implemented July 2023, still in effect) is the only real-world data on work requirement effects — worth monitoring as a harbinger of 2027 national effects + +**Extraction hints:** Primary claim: OBBBA work requirements are mandatory January 1, 2027, but as of January 2026, all state waiver applications are pending and no early implementations have begun (except Nebraska via state plan amendment). Secondary: the distinction between already-in-effect provisions (provider tax freeze, CHW funding constraints) and future-effect provisions (work requirements, enrollment disruption) is important for KB temporal accuracy. + +**Context:** This source is primarily valuable as a timeline clarification and status update for the Session 8 OBBBA analysis. The structural finding (VBC enrollment disruption mechanism) is unchanged. The observable impact is 2027+. + +## Curator Notes (structured handoff for extractor) +PRIMARY CONNECTION: Session 8 OBBBA claim candidates on VBC enrollment disruption and CHW program blocking +WHY ARCHIVED: Provides current implementation status — clarifies that work requirement effects are 2027+ observable, not 2026; helps scope temporal accuracy of KB claims +EXTRACTION HINT: The CHW/provider tax freeze (already in effect) and work requirements (January 1, 2027) should be extracted as two separate claims with different temporal scopes. Current Session 8 claim candidates may conflate them. diff --git a/inbox/queue/2026-03-22-openevidence-sutter-health-epic-integration.md b/inbox/queue/2026-03-22-openevidence-sutter-health-epic-integration.md new file mode 100644 index 00000000..64f4f321 --- /dev/null +++ b/inbox/queue/2026-03-22-openevidence-sutter-health-epic-integration.md @@ -0,0 +1,58 @@ +--- +type: source +title: "OpenEvidence Embeds in Epic EHR at Sutter Health (February 2026)" +author: "BusinessWire / OpenEvidence / Sutter Health" +url: https://www.businesswire.com/news/home/20260211318919/en/Sutter-Health-Collaborates-with-OpenEvidence-to-Bring-Evidence-Based-AI-Powered-Insights-into-Physician-Workflows +date: 2026-02-11 +domain: health +secondary_domains: [ai-alignment] +format: press release +status: unprocessed +priority: medium +tags: [openevidence, sutter-health, epic-ehr, clinical-ai, ehr-integration, workflow-ai, automation-bias, california] +--- + +## Content + +Announced February 11, 2026: Sutter Health (one of California's largest health systems, ~12,000+ affiliated physicians) has entered a collaboration with OpenEvidence to embed AI-powered clinical decision support within Epic EHR workflows. + +**Key details:** +- OE will be integrated within Epic's electronic health record system at Sutter Health +- Enables natural-language search for guidelines, peer-reviewed studies, and clinical evidence within the EHR +- Physicians can access OE during clinical workflow without opening a separate application +- Stated goal: "advance healthcare sustainability and medical AI safety" +- Sutter Health: 30 hospitals, 900+ care centers, ~12,000 affiliated physicians in California + +**Context from other sources:** +- BusinessWire announcement (February 11, 2026); Healthcare IT News; HLTH platform coverage +- Sutter Health is described as having "high standards for quality, safety and patient-centered care" +- No mention of prospective outcomes study or safety evaluation pre-deployment +- The partnership announcement coincides with OE being cited in the ARISE State of Clinical AI 2026 as a "consumer-facing" tool used to bypass institutional IT + +**Previously:** OE was primarily used as a standalone app — physicians opened it separately from their EHR. The Sutter integration makes OE a native in-workflow tool. + +## Agent Notes +**Why this matters:** This is a structural shift in how OE's safety risk profile operates. A tool used as a voluntary external lookup has different automation bias dynamics than a tool embedded in the clinical workflow. Research on in-context vs. external AI consistently shows in-context suggestions generate higher adherence. The Sutter integration essentially institutionalizes the "safety paradox" that ARISE identified — instead of physicians bypassing institutional governance to use OE, Sutter's institutional governance IS OE. + +**What surprised me:** The absence of any mention of pre-deployment safety evaluation. Given that: +- The NOHARM study found 12-22% severe clinical errors in top LLMs (published January 2026) +- The Nature Medicine bias study documented systematic demographic bias across all models (2025) +- OE has zero prospective clinical outcomes evidence +...it is notable that a major health system is embedding OE in primary clinical workflows without mentioning a formal safety evaluation. This is the scale-safety asymmetry at its most acute. + +**What I expected but didn't find:** Any mention of: how OE's model was selected, what safety benchmarks were reviewed, whether OE was evaluated against NOHARM or similar frameworks before deployment, or what clinical governance oversight Sutter has put in place for in-EHR AI. + +**KB connections:** +- Extends Session 9 finding on OE scale-safety asymmetry (now at health-system EHR level) +- Connects to Session 8 (Catalini verification bandwidth) — in-EHR suggestions at physician workflow speed make verification even harder +- ARISE "safety paradox" framing applies directly: this integration institutionalizes the workaround +- If OE has the sociodemographic biases documented in the Nature Medicine study, those biases are now embedded in Sutter's clinical workflows + +**Extraction hints:** The primary claim is structural: EHR embedding of clinical AI with zero prospective outcomes evidence creates a different (higher) automation bias risk profile than standalone app use. The absence of safety evaluation documentation before deployment is itself a finding about governance gaps. + +**Context:** Sutter Health is a major California health system that serves approximately 3.3 million patients annually. Its physician count (~12,000 affiliated) means the OE-Epic integration could affect millions of patient encounters annually. This is not a pilot — it's a full health-system deployment. + +## Curator Notes (structured handoff for extractor) +PRIMARY CONNECTION: Session 9 finding on OpenEvidence scale (30M+ monthly consultations, valuation-evidence asymmetry) +WHY ARCHIVED: First major EHR integration of OE — changes the automation bias risk profile from standalone app to in-workflow embedded tool; no safety evaluation mentioned pre-deployment +EXTRACTION HINT: Focus on the governance gap: EHR embedding without prospective safety validation. This is a structural claim about how health system procurement decisions interact with clinical AI safety evidence requirements. diff --git a/inbox/queue/2026-03-22-stanford-harvard-noharm-clinical-llm-safety.md b/inbox/queue/2026-03-22-stanford-harvard-noharm-clinical-llm-safety.md new file mode 100644 index 00000000..c53a55ac --- /dev/null +++ b/inbox/queue/2026-03-22-stanford-harvard-noharm-clinical-llm-safety.md @@ -0,0 +1,51 @@ +--- +type: source +title: "First, Do NOHARM: Towards Clinically Safe Large Language Models (Stanford/Harvard, January 2026)" +author: "Stanford/Harvard ARISE Research Network" +url: https://arxiv.org/abs/2512.01241 +date: 2026-01-02 +domain: health +secondary_domains: [ai-alignment] +format: research paper +status: unprocessed +priority: high +tags: [clinical-ai-safety, llm-errors, omission-bias, noharm-benchmark, stanford, harvard, clinical-benchmarks, medical-ai] +--- + +## Content + +The NOHARM study ("First, Do NOHARM: Towards Clinically Safe Large Language Models") evaluated 31 large language models against 100 real primary care consultation cases spanning 10 medical specialties. Clinical cases were drawn from 16,399 real electronic consultations at Stanford Health Care, with 12,747 expert annotations for 4,249 clinical management options. + +**Core findings:** +- Severe harm in up to **22.2% of cases** (95% CI 21.6-22.8%) across 31 tested LLMs +- **Harms of omission account for 76.6% (95% CI 76.4-76.8%) of all severe errors** — missing necessary actions, not giving wrong actions +- Best performers (Gemini 2.5 Flash, LiSA 1.0): 11.8-14.6 severe errors per 100 cases +- Worst performers (o4 mini, GPT-4o mini): 39.9-40.1 severe errors per 100 cases +- Safety performance only moderately correlated with existing AI/medical benchmarks (r = 0.61-0.64) — **USMLE scores do not predict clinical safety** +- Best models outperform generalist physicians on safety (mean difference 9.7%, 95% CI 7.0-12.5%) +- Multi-agent approach reduces harm vs. solo model (mean difference 8.0%, 95% CI 4.0-12.1%) + +Published to arxiv December 2025 (2512.01241). Findings reported by Stanford Medicine January 2, 2026. Referenced in the Stanford-Harvard State of Clinical AI 2026 report. + +Related coverage: ppc.land, allhealthtech.com + +## Agent Notes +**Why this matters:** The NOHARM study is the most rigorous clinical AI safety evaluation to date, testing actual clinical cases (not exam questions) from a real health system, with 12,747 expert annotations. The 76.6% omission finding is the most important number: it means the dominant clinical AI failure is not "AI says wrong thing" but "AI fails to mention necessary thing." This directly reframes the OpenEvidence "reinforces plans" finding as dangerous — if OE confirms a plan containing an omission (the most common error type), it makes that omission more fixed. + +**What surprised me:** Two surprises: (1) The omission percentage is much higher than commissions — this is counterintuitive because AI safety discussions focus on hallucinations (commissions). (2) Best models actually outperform generalist physicians on safety (9.7% improvement) — this means clinical AI at its best IS safer than the human baseline, which complicates simple "AI is dangerous" framings. The question becomes: does OE use best-in-class models? OE has never disclosed its architecture or safety benchmarks. + +**What I expected but didn't find:** I expected more data on how often physicians override AI recommendations when errors occur. The NOHARM study doesn't include physician-AI interaction data — it only tests AI responses, not physician behavior in response to AI. + +**KB connections:** +- Directly extends Belief 5 (clinical AI safety risks) with a specific error taxonomy (omission-dominant) +- Challenges the "centaur model catches errors" assumption — if errors are omissions, physician oversight doesn't activate because physician doesn't know what's missing +- Safety benchmarks (USMLE) do not correlate well with safety — challenges OpenEvidence's benchmark-based safety claims + +**Extraction hints:** The omission/commission distinction is the primary extractable claim. Secondary: benchmark performance does not predict clinical safety (this challenges OE's marketing of its USMLE 100% score as evidence of safety). Tertiary: best models outperform physicians — this is the nuance that prevents simple "AI is bad" claims. + +**Context:** Published in December 2025, findings widely covered January 2026. Referenced in the Stanford-Harvard ARISE State of Clinical AI 2026 report. The NOHARM benchmark (100 primary care cases, 31 models, 10 specialties) is likely to become a standard evaluation framework for clinical AI. + +## Curator Notes (structured handoff for extractor) +PRIMARY CONNECTION: "clinical AI augments physicians but creates novel safety risks requiring centaur design" (Belief 5 supporting claim) +WHY ARCHIVED: Defines the dominant clinical AI failure mode (omission vs. commission) — directly reframes the risk profile of tools like OpenEvidence +EXTRACTION HINT: Focus on the 76.6% omission figure and its interaction with OE's "reinforces plans" mechanism. Also extract the benchmark-safety correlation gap (r=0.61) as a second claim challenging USMLE-based safety marketing.