diff --git a/agents/vida/musings/research-2026-03-24.md b/agents/vida/musings/research-2026-03-24.md new file mode 100644 index 00000000..08ebfa61 --- /dev/null +++ b/agents/vida/musings/research-2026-03-24.md @@ -0,0 +1,222 @@ +--- +status: developed +type: musing +stage: complete +created: 2026-03-24 +last_updated: 2026-03-24 +tags: [clinical-ai-safety, nhs-dtac, eu-ai-act, regulatory-compliance, openevidence, belief-5-disconfirmation, belief-1-disconfirmation, deaths-of-despair, healthspan, pnas-cohort-mortality, real-world-deployment-gap, centaur-model, pharmacist-copilot, lords-inquiry, obbba, glp1-digital] +--- + +# Research Session 12: Keystone Belief Confirmed and Strengthened; Regulatory Track Clarified; Fifth Clinical AI Failure Mode + +## Research Question + +**Are clinical AI companies actually preparing for NHS DTAC V2 (April 6, 2026) and EU AI Act (August 2026) — and does emerging regulatory compliance behavior represent the first observable closing of the commercial-research gap? Secondary: what does new evidence say about deaths of despair and US life expectancy (Belief 1 disconfirmation attempt)?** + +## Why This Question + +Two concurrent targets: + +**Thread A (primary — regulatory track from Session 11):** The NHS DTAC V2 April 6 deadline was framed in Session 11 as a major compliance moment. Session 12 tested whether this was substantive. Secondary: does the NHS supplier registry (19 vendors, January 2026) represent the actual compliance mechanism? + +**Thread B (Belief 1 disconfirmation):** Belief 1 hasn't been targeted since Session 7 (March 19). The CDC's +0.6 year LE improvement in 2024 represents the strongest surface-level evidence against the "compounding failure" thesis. Can it be used to challenge the keystone belief? + +**Disconfirmation targets:** +- Belief 5: Does emerging regulatory compliance or the pharmacist+LLM co-pilot evidence undermine the pessimistic clinical AI safety reading? +- Belief 1: Does the 2024 US LE recovery to 79.0 years, or any new deaths of despair data, suggest self-correction in the healthspan binding constraint? + +--- + +## What I Found + +### Finding 1: DTAC V2 April 6 Deadline Is Administrative — Less Consequential Than Session 11 Framed + +**Correction:** NHS DTAC V2 (published February 24, 2026) is a **form update** (25% fewer questions, de-duplication with DSPT and pre-acquisition questionnaire). The April 6 deadline is the date when the old form must be retired, not a new substantive compliance gate. The clinical safety requirements (DCB0160, DCB0129) are unchanged. + +**What IS the consequential mechanism:** The NHS England AI Scribing Supplier Registry (launched January 16, 2026) with 19 vendors meeting DTAC + MHRA Class 1 requirements. This registry is operational and open for new applications. THAT is the forcing function, not the DTAC V2 form deadline. + +**Key observation:** OpenEvidence is absent from the 19-vendor registry despite OE "Visits" (documentation tool, August 2025) being a direct category competitor. OE's public website contains no DTAC assessment and no MHRA Class 1 registration. OE has signaled 2026 UK expansion targeting UK, Canada, Australia as "English-first markets with lower regulatory barriers" — but this characterization appears to be a strategic misjudgment: NHS requires DTAC + MHRA Class 1 for formal procurement of documentation tools. + +**Practical implication:** OE Visits **cannot be formally deployed in NHS settings** without completing DTAC and MHRA Class 1. Informal use by individual clinicians continues (OE is already being reviewed and discussed in UK clinical contexts), but NHS organizational procurement requires compliance that OE hasn't demonstrated. + +### Finding 2: New Clinical Risk for OE in UK Markets — Corpus Mismatch (Previously Undocumented) + +iatroX Clinical AI Insights (UK-focused clinical AI review) documents a failure mode for OE in UK clinical practice that is **distinct from** the four failure modes documented in Sessions 8-11: + +- OE uses a **US-centric corpus**: cites AHA guidelines rather than NICE guidelines +- May suggest drugs **licensed in the US but not available in UK** (different BNF formulary) +- Dosing standards and treatment pathways may differ from UK clinical practice +- UK clinicians using OE may receive recommendations that are guideline-adherent for the US but not for the UK + +This is not an LLM failure mode — it's a **data architecture mismatch**. The LLM may be accurate according to US evidence, but wrong for UK clinical practice. Relevant quote: "OE's UK-specific governance (DTAC/DCB) is not explicitly positioned on its public pages." + +**This is a SIXTH distinct clinical AI risk for OE specifically, not just a fifth general LLM failure mode.** The corpus mismatch is potentially more immediately harmful than probabilistic LLM failure modes because it affects ALL recommendations in specific clinical areas (drug prescribing, guideline-concordant treatment). + +### Finding 3: Fifth General LLM Clinical Failure Mode — The Real-World Deployment Gap + +Oxford Internet Institute + Nuffield Dept. of Primary Care, published *Nature Medicine*, February 2026 (1,298 participants, randomized, preregistered): + +- **LLMs alone:** 94.9% correct condition identification; 56.3% correct disposition +- **Participants using LLMs:** <34.5% correct condition; <44.2% correct disposition — **NO BETTER THAN CONTROL GROUP** +- A 60-percentage-point collapse between LLM isolated performance and user-assisted performance + +Root cause: **"two-way communication breakdown"** — users didn't know what the LLM needed; responses mixed good and poor recommendations making it hard to extract correct action. + +**Study conclusion:** "Just as clinical trials are required for medications, AI systems need rigorous testing with diverse, real users." + +**Scope note:** This was PUBLIC use (general population), not physician use like OE. The mechanism may be weaker for trained physicians. But the finding is structural: benchmark performance is NOT a predictor of real-world user-assisted outcomes. The JMIR systematic review of 761 LLM evaluation studies confirms: only 5% used real patient care data; 95% used USMLE-style exam questions. The benchmark-to-reality gap is systematic. + +**Five general LLM clinical failure modes now documented:** +1. Omission-reinforcement (NOHARM: 76.6% of severe errors are omissions) +2. Demographic bias amplification (Nature Medicine, JMIR e78132: systematic bias across care settings) +3. Automation bias robustness (NCT06963957: survives 20-hour training) +4. Medical misinformation propagation (Lancet DH: 32%/47% in clinical language) +5. **Real-world deployment gap (Oxford/Nature Medicine RCT: 60pp performance collapse in user interaction)** + +**Six OE-specific risks (five above + corpus mismatch in non-US markets).** + +### Finding 4: Counter-Evidence — Centaur Model Works Under Specific Conditions + +*Cell Reports Medicine*, October 2025 (PMC12629785), 91 error scenarios across 16 clinical specialties: + +- Pharmacist + LLM co-pilot: **61% accuracy**; **1.5x improvement for serious harm errors vs. pharmacist alone** +- Architecture: RAG (retrieval-augmented generation) from curated drug database — NOT parametric memory + +**This is the best positive clinical AI safety evidence found across 12 sessions.** The centaur design CAN work, but under specific conditions: +1. Domain expert is ENGAGED and in co-pilot mode (not automation bias mode) +2. LLM uses RAG from curated database (reduces hallucination, corpus mismatch, misinformation propagation) +3. Task is STRUCTURED (medication safety review — not open-ended clinical reasoning) + +**The conditions matter.** OE doesn't use this architecture: it's a general clinical reasoning tool, not a structured RAG safety checker. But the pharmacist+LLM co-pilot result provides the mechanistic proof that the centaur design can work — it requires design intentionality, not just human oversight. + +### Finding 5: Belief 1 CONFIRMED AND STRENGTHENED — Post-1970 Cohort Mortality Deterioration + +**PNAS 2026** (Abrams & Bramajo et al., UTMB, published March 9-10, 2026): +- Post-1970 cohorts: **increasing mortality in CVD, cancer, AND external causes** vs. predecessors — across ALL three cause groups simultaneously +- **A broad mortality deterioration beginning around 2010** affected **nearly every living adult cohort** — not just younger generations +- Projected: "**unprecedented longer-run stagnation, or even sustained decline**, in US life expectancy" +- Not a single-cause problem: "complex convergence of rising chronic disease, shifting behavioral risks, and increases in certain cancers among younger adults" + +**Context:** CDC reports 2024 US life expectancy reached **79.0 years** (up 0.6 from 78.4 in 2023) — three consecutive years of post-COVID recovery. BUT the PNAS cohort analysis shows this surface improvement is a COVID/overdose recovery, not structural improvement. The cohort trajectory is worsening. + +**The "2010 period effect" is the most significant new finding for Belief 1:** Something systemic changed around 2010 that made EVERY adult cohort simultaneously sicker. This is not a generational behavioral story — it's an environmental/systemic story. The 1950s birth cohort is the transition point from improvement to deterioration. + +**Belief 1 disconfirmation result: FAILED.** The strongest candidate for disconfirmation (CDC's +0.6 year improvement) is surface noise over a deepening structural problem. The PNAS analysis provides the most comprehensive multi-cause confirmation of the compounding failure thesis to date. + +### Finding 6: Regulatory Track — Four Mechanisms, Not Three + +Session 11 identified THREE tracks (commercial, research, regulatory). Session 12 identifies **four**: + +**Track 3A — EU AI Act (August 2026, European deployments):** Unchanged from Session 11. OE has made no compliance announcements for European markets. + +**Track 3B — NHS Procurement (UK, operational now):** The supplier registry is the mechanism — 19 vendors compliant, OE absent. UK expansion requires DTAC + MHRA Class 1. This is OE's choice point. + +**Track 4 — UK Parliamentary Scrutiny (March 2026, ongoing):** House of Lords Science and Technology Committee launched "Innovation in the NHS: Personalised Medicine and AI" inquiry on March 10, 2026. Written evidence deadline: April 20, 2026. Focus: why does the NHS struggle to adopt innovation, and what's blocking it? This is adoption-focused (opposite framing from EU AI Act's safety focus). If the inquiry recommends procurement reform that streamlines AI adoption, it could accelerate OE's NHS path — but would also require completing the governance requirements that streamlining doesn't eliminate. + +### Finding 7: OBBBA Work Requirements — Implementation On Track + +As of January 2026: +- 7 states with pending Section 1115 waivers (Arizona, Arkansas, Iowa, Montana, Ohio, South Carolina, Utah) +- Nebraska implementing via state plan amendment (without waiver) — ahead of federal mandate +- Federal mandate deadline: December 31, 2026 (with extension to 2028 available) +- Coverage loss effects begin: Q1 2027 + +This confirms Session 8's structural concern: VBC enrollment stability will be disrupted beginning Q1 2027. The BALANCE model's effectiveness under enrollment fragmentation is the key question for 2027. + +--- + +## Synthesis + +**The clinical AI safety picture after 12 sessions:** + +The failure mode catalogue is now comprehensive: +- Five general LLM failure modes (vs. three when this thread started in Session 8) +- One OE-specific failure mode in non-US markets (corpus mismatch) +- One counter-evidence case for centaur design (pharmacist+RAG+structured task) +- One fundamental evaluation methodology problem (95% of studies use exam questions, not real patient data) + +The regulatory track has four mechanisms, not three. The NHS supplier registry (operational) and Lords inquiry (adoption-focused) are the UK-specific mechanisms. The EU AI Act remains the largest-scale forcing function (August 2026). None of these mechanisms are yet producing OE safety disclosure. + +**The centaur design insight from Session 12:** The pharmacist+LLM co-pilot result shows the design that would work: RAG architecture, domain expert as engaged co-pilot, structured safety task. OE's design (general clinical reasoning, physician as consumer not co-pilot) is architecturally different from the pharmacist+LLM model. The centaur isn't broken; OE isn't the centaur. + +**Belief 1 after Session 12:** The keystone belief is more structurally grounded than it was before this session. The PNAS 2026 multi-cause cohort analysis is the strongest evidence Vida has encountered for the compounding failure thesis. The 2010 period effect (all cohorts deteriorating simultaneously) opens a new research direction: what systemic factor changed in 2010? + +--- + +## Claim Candidates + +CLAIM CANDIDATE 1: "US life expectancy stagnation is rooted in a post-1970 birth cohort mortality deterioration spanning cardiovascular disease, cancer, and external causes simultaneously — and a period-effect beginning around 2010 that deteriorated every living adult cohort — portending unprecedented longer-run stagnation or sustained decline (PNAS 2026)" +- Domain: health +- Confidence: proven (PNAS peer-reviewed, large n, 1979-2023 data, confirmed by companion PNAS forecast paper) +- Sources: PNAS doi: 10.1073/pnas.2519356123 (March 2026), UTMB newsroom +- KB connections: Strongest structural confirmation of Belief 1 compounding failure thesis; extends deaths-of-despair framing to include CVD and cancer cohort deterioration + +CLAIM CANDIDATE 2: "LLMs achieve 94.9% clinical condition identification accuracy in isolation but participants using the same LLMs perform no better than control groups (<34.5%) — establishing a real-world deployment gap between LLM knowledge and user-assisted outcome improvement that is not predicted by benchmark performance (Nature Medicine RCT, 1,298 participants, Oxford 2026)" +- Domain: health, secondary: ai-alignment +- Confidence: proven (RCT, preregistered, 1,298 participants, three LLMs all showing same gap) +- Sources: Nature Medicine Vol 32 p. 609-615 (February 2026, Oxford) +- KB connections: Fifth distinct clinical AI failure mode; methodologically distinct from automation bias (different mechanism: user fails to extract correct guidance, not physician deferring to wrong guidance); paired with JMIR 95% benchmark evaluation finding + +CLAIM CANDIDATE 3: "Pharmacist + LLM co-pilot using retrieval-augmented generation improves serious medication harm detection by 1.5x vs. pharmacist alone across 16 clinical specialties — evidence that the centaur model works under conditions of domain expert engagement, RAG architecture, and structured safety tasks (Cell Reports Medicine, October 2025)" +- Domain: health, secondary: ai-alignment +- Confidence: likely (prospective cross-over, 91 scenarios, 16 specialties, peer-reviewed Cell Press journal; RAG architecture constraint is key scope qualifier) +- Sources: Cell Reports Medicine doi: 10.1016/j.xcrm.2025.00396-9; PMC12629785 +- KB connections: Counter-evidence to the pessimistic reading of Belief 5; establishes design conditions under which centaur succeeds vs. fails; contrasts with automation bias finding (NCT06963957) where centaur fails + +CLAIM CANDIDATE 4: "OpenEvidence's US-centric clinical corpus creates a distinct category of harm in UK clinical practice — guideline mismatch with NICE recommendations, BNF formulary discrepancies, and off-license drug suggestions — independent of LLM failure modes and unaddressed by OE's absence of DTAC assessment or MHRA registration as of March 2026" +- Domain: health +- Confidence: proven (guideline corpus mismatch is documented; governance absence is documented fact; iatroX review is independent UK clinical assessment) +- Sources: iatrox.com review series 2025-2026; NHS DTAC guidance; MHRA medical device registration requirements +- KB connections: Sixth OE-specific clinical risk; extends the OE safety opacity thread from Sessions 8-11 into non-US markets; connects to NHS supplier registry absence + +CLAIM CANDIDATE 5: "95% of clinical LLM evaluation studies assessed performance on medical examination questions rather than real patient care data — establishing a systematic evaluation methodology gap that makes USMLE-level benchmark performance uninterpretable as a clinical safety signal (JMIR systematic review, 761 studies, 39 benchmarks)" +- Domain: health, secondary: ai-alignment +- Confidence: proven (systematic review of 761 studies, peer-reviewed JMIR, PMC12706444) +- Sources: JMIR e84120 (2025); PMC12706444 +- KB connections: Foundational methodology claim for the benchmark-to-reality gap; explains why OE's "100% USMLE" benchmark performance cited in Session 9 is not interpretable as a clinical safety signal; pairs with Oxford/Nature Medicine RCT as the empirical demonstration + +--- + +## Disconfirmation Results + +**Belief 1 (keystone — healthspan as binding constraint): NOT DISCONFIRMED. STRUCTURALLY STRENGTHENED.** +The strongest disconfirmation candidate (CDC 2024 LE recovery to 79.0 years) is surface noise over the structural deterioration documented in the PNAS cohort analysis. The compounding failure thesis is now supported by multi-cause, multi-cohort evidence spanning CVD, cancer, and external causes — not just deaths of despair. + +**Belief 5 (clinical AI safety): NOT DISCONFIRMED. Failure mode catalogue extended to five (general) + one (OE-specific).** +Counter-evidence found (pharmacist+LLM co-pilot, Cell Reports Medicine): centaur design works under RAG+structured+expert-engaged conditions. This is meaningful — the design EXISTS that would work. OE's architecture differs from this design. + +--- + +## Follow-up Directions + +### Active Threads (continue next session) + +- **PNAS "2010 period effect" — what systemic change explains the 2010 deterioration across all cohorts?** This is the most important unexplored question in the Belief 1 thread. ACA passage was 2010; opioid crisis peaked 2015-2016; social media became mass-market 2009-2012. Multiple candidate mechanisms. A targeted search for research on "what changed in 2010 in US mortality" could yield a new structural claim. + +- **EU AI Act August 2026 — OE European compliance status:** Unchanged from Session 11. The five-month clock is now down to ~4.5 months. Watch for: any OE press release mentioning EU compliance, any European health system partnership that would trigger Annex III obligations. + +- **Lords inquiry evidence submissions:** Written evidence deadline is April 20, 2026 — 27 days away. The submissions from NHS trusts, clinical AI companies, and researchers will be published on the Parliament website. This is potentially the richest multi-voice clinical AI governance document of 2026. Watch for OE's submission (if filed) or NHS trust perspectives on clinical AI safety barriers. + +- **NCT07328815 (ensemble LLM confidence signals behavioral nudge trial):** Still no results. Continue watching. + +- **OE UK expansion actual timeline:** The 2026 signal is there but no concrete UK product announcement. Watch for: (a) DTAC assessment filing by OE, (b) MHRA Class 1 registration by OE, (c) OE Visits being offered to NHS trusts. + +### Dead Ends (don't re-run) + +- **Tweet feeds:** Confirmed dead. Don't check. +- **OE-specific demographic bias evaluation:** Confirmed dead in Session 11. Don't re-run. +- **Big Tech GLP-1 adherence native platform:** Confirmed dead across Sessions 9-12. Don't re-run. +- **DTAC V2 April 6 as major compliance gate:** Confirmed this session that it's a form update, not a new substantive requirement. Don't re-frame this as a forcing function. +- **Canada semaglutide generics data:** Health Canada rejection (Dr. Reddy's) confirmed in Session 10. 2027 at earliest. + +### Branching Points + +- **2010 mortality deterioration — behavioral vs. structural cause:** + - Direction A: The 2010 period effect is primarily driven by opioid crisis and deaths of despair (behavioral) — which are beginning to stabilize as overdose deaths plateau. Implications: the period effect may be transient, and the Belief 1 compounding failure framing is stronger for the cohort effect (permanent) than the period effect (potentially reversing). + - Direction B: The 2010 period effect is systemic (ACA insurance disruption, great recession sequelae, metabolic disease epidemic acceleration, social isolation amplified by smartphone/social media) — structural rather than behavioral. Implications: the period effect continues and compounds with the cohort effect, accelerating projected decline. + - **Recommendation: Direction B seems more consistent with the multi-cause finding (CVD AND cancer AND external causes all deteriorating — not just overdose). A behavioral drug crisis would show up primarily in external causes; CVD and cancer deteriorating together suggests metabolic/systemic drivers.** + +- **Lords inquiry impact — adoption vs. safety framing race in UK:** + - Direction A: The Lords inquiry focuses on adoption blockage and produces recommendations that streamline NHS AI procurement. Clinical AI adoption accelerates but safety requirements remain minimal (DTAC is the floor). Safety concerns documented in research continue to diverge from commercial deployment. + - Direction B: Evidence submissions to the Lords inquiry surface the clinical AI safety literature (NOHARM, Oxford RCT, Nature Medicine bias studies) and the inquiry expands its mandate to include safety governance recommendations. This would be the most consequential UK regulatory event for clinical AI safety since the NHS began digitizing. + - **Recommendation: Direction A is more likely given the inquiry's explicit framing ("why aren't we adopting faster?"). Direction B requires a compelling evidence submission that re-frames adoption failure as a safety feature, not a bug. Watch evidence submissions carefully.** diff --git a/agents/vida/research-journal.md b/agents/vida/research-journal.md index 0311e34e..45fc263e 100644 --- a/agents/vida/research-journal.md +++ b/agents/vida/research-journal.md @@ -1,5 +1,27 @@ # Vida Research Journal +## Session 2026-03-24 — Keystone Belief Confirmed by PNAS Cohort Study; Fifth Clinical AI Failure Mode; Regulatory Track Clarified + +**Question:** Are clinical AI companies preparing for NHS DTAC V2 (April 6) and EU AI Act (August 2026) compliance — and does this represent the first observable closing of the commercial-research gap? Secondary: does new 2026 evidence challenge Belief 1 (healthspan as binding constraint)? + +**Belief targeted:** Dual focus. Belief 1 (keystone): disconfirmation attempt targeting the CDC's 2024 LE recovery as potential counter-evidence to the compounding failure thesis. Belief 5 (clinical AI safety): regulatory compliance behavior as potential gap-closer; Cell Reports Medicine centaur evidence as counter-evidence to pessimistic reading. + +**Disconfirmation result:** +- **Belief 1: NOT DISCONFIRMED — STRUCTURALLY STRENGTHENED.** PNAS 2026 (Abrams & Bramajo, UTMB, March 9-10) provides the most comprehensive structural confirmation of the compounding failure thesis to date: post-1970 cohorts show increasing mortality from CVD, cancer, AND external causes simultaneously. A period-effect beginning around 2010 deteriorated every living adult cohort. CDC 2024 LE recovery to 79.0 (up 0.6 years) is surface noise over structural deterioration. "Unprecedented longer-run stagnation or sustained decline" projected. +- **Belief 5: NOT DISCONFIRMED — Failure mode catalogue extended to five.** Oxford/Nature Medicine RCT (1,298 participants, preregistered): LLMs achieve 94.9% condition accuracy in isolation but <34.5% in user interaction — NO better than control. 60pp deployment gap is the fifth distinct failure mode (vs. four from Sessions 8-11). Counter-evidence: Cell Reports Medicine pharmacist+LLM co-pilot (1.5x improvement for serious harm errors) shows centaur works under RAG+structured+expert-engaged conditions. OE's design doesn't match these conditions. + +**Key finding:** DTAC V2 April 6 deadline is less consequential than Session 11 framed — it's a form update (25% fewer questions), NOT a new compliance gate. The real UK regulatory forcing mechanism is the NHS AI scribing supplier registry (19 vendors operational since January 16, 2026). OE is absent from registry despite "Visits" being a direct category competitor. New OE-specific UK risk identified: US-centric corpus creates NICE/BNF guideline mismatch and off-license drug suggestions — a sixth risk category distinct from LLM failure modes. UK House of Lords launched "Innovation in NHS: Personalised Medicine and AI" inquiry (March 10, 2026) — adoption-focused, evidence deadline April 20. Four regulatory/policy tracks now active, none yet producing OE safety disclosure. + +**Pattern update:** The structural pattern (compounding failure, theory-practice gap, commercial-research divergence) is now confirmed across 12 sessions with increasingly granular evidence. Session 12 adds two dimensions: (1) the "2010 period effect" — something systemic changed around 2010 deteriorating every adult cohort simultaneously, suggesting an environmental/systemic cause beyond behavioral cohort effects; (2) the centaur design that works (RAG+structured+expert co-pilot) vs. OE's architecture (general reasoning, physician as consumer). The gap is not that centaur design is impossible — it's that the commercial product doesn't implement it. + +**Confidence shift:** +- Belief 1 (healthspan as binding constraint): **SIGNIFICANT STRENGTHENING** — PNAS 2026 multi-cause, multi-cohort analysis is the strongest structural confirmation in 12 sessions. The compounding failure thesis extends beyond deaths of despair to include CVD and cancer deterioration in post-1970 cohorts. +- Belief 5 (clinical AI safety): **FIFTH FAILURE MODE ADDED** (real-world deployment gap, Oxford Nature Medicine 2026). **CENTAUR DESIGN PARTIALLY VINDICATED** under specific conditions (RAG+structured+expert co-pilot). Net: the safety concern remains but the design solution is more concrete than before. +- Session 11 "DTAC V2 as major regulatory event": **CORRECTED** — form update, not new compliance gate. The supplier registry is the actual mechanism. +- OE UK expansion: **NEW RISK IDENTIFIED** — corpus mismatch adds a sixth clinical risk category for non-US markets, distinct from LLM failure modes. OE's "lower regulatory barriers" characterization of UK market appears inaccurate. + +--- + ## Session 2026-03-23 — OE Model Opacity, Multi-Agent Market Entry, and the Commercial-Research-Regulatory Trifurcation **Question:** Has OpenEvidence been specifically evaluated for the sociodemographic biases documented across all LLMs in Nature Medicine 2025 — and are multi-agent clinical AI architectures (NOHARM's proposed harm-reduction approach) entering the clinical market as a safety design? diff --git a/inbox/queue/2025-04-01-jmir-glp1-digital-engagement-outcomes-retrospective.md b/inbox/queue/2025-04-01-jmir-glp1-digital-engagement-outcomes-retrospective.md new file mode 100644 index 00000000..1bd41e6e --- /dev/null +++ b/inbox/queue/2025-04-01-jmir-glp1-digital-engagement-outcomes-retrospective.md @@ -0,0 +1,61 @@ +--- +type: source +title: "JMIR 2025: Digital Engagement Enhances GLP-1 Weight Loss Outcomes — 11.53% vs. 8% at Month 5 (Engaged vs. Non-Engaged)" +author: "Johnson et al. (Diabetes, Obesity and Metabolism / JMIR)" +url: https://www.jmir.org/2025/1/e69466 +date: 2025-04-01 +domain: health +secondary_domains: [] +format: research-paper +status: unprocessed +priority: medium +tags: [glp1, semaglutide, digital-health, behavioral-support, adherence, weight-loss, atoms-to-bits, belief-4, real-world-data] +--- + +## Content + +Published in *Journal of Medical Internet Research* (JMIR), 2025, e69466. Also published in *Diabetes, Obesity and Metabolism* (Wiley, doi: 10.1111/dom.70244) as "Digital engagement enhances dual GIP/GLP-1 receptor agonist and GLP-1 receptor agonist efficacy." + +PMC archive: PMC11997532. + +**Study design:** Retrospective cohort service evaluation of a digital weight management platform integrated with GLP-1 therapy (both semaglutide and tirzepatide). Compares engaged vs. non-engaged participants. + +**Key findings:** +- At month 5: **Engaged participants: 11.53% mean weight loss** vs. **non-engaged: 8%** — a 3.5 percentage point advantage from digital engagement +- Digital platform: live group video coaching, text-based in-app support, dynamic educational content, real-time weight monitoring, medication adherence tracking +- Real-world data: "roughly half of users stopping within a year" but persistence improves to 63% when supply and coverage issues addressed + +**Related finding (Danish study, previously documented):** +- Online weight-loss program + semaglutide at half typical dose → 16.7% weight loss over 64 weeks +- Equivalent outcomes at half the drug dose with behavioral support + +**2026 context:** +- Oral semaglutide FDA-approved for weight management (2026) — may improve adherence via non-injection route +- "2026 is the year GLP-1s grow up" (MM+M) — shift from prescription volume to outcomes metrics and adherence management + +## Agent Notes + +**Why this matters:** This is US real-world data (not Danish controlled study) confirming the digital engagement effect on GLP-1 outcomes. The 11.53% vs. 8% difference (3.5pp advantage) is clinically meaningful — equivalent to one additional dose level in many GLP-1 titration protocols. Under capitated payment models (VBC), this difference could determine whether GLP-1s are cost-saving or cost-additive for a population. + +**What surprised me:** The study covers BOTH semaglutide and tirzepatide, showing the digital engagement effect generalizes across the GLP-1/GIP class. This isn't just a semaglutide story; behavioral support amplifies both molecules. + +**What I expected but didn't find:** Evidence that specific behavioral support components (coaching vs. monitoring vs. education) drive the effect differentially. The study doesn't disambiguate which platform element drives the 3.5pp advantage. The Danish study's insight (half-dose = equivalent outcomes) was more mechanistically useful. + +**KB connections:** +- Extends and confirms the Danish study finding (previously documented in Session 4) with US real-world data +- Strengthens Belief 4 (atoms-to-bits) — behavioral/digital support ("bits") amplifies GLP-1 efficacy ("atoms"), confirming the defensible value layer thesis +- Connects to the GLP-1 adherence paradox (Session 3): MA plans restrict access despite downstream savings; this data shows the magnitude of lost savings from non-engagement +- The 63% persistence when supply/coverage issues resolved → the access barrier (OBBBA Medicaid cuts) is a direct threat to realizing these outcomes at population scale +- Oral semaglutide FDA approval for weight management (2026) = potential adherence improvement; this is a new data point not in prior sessions + +**Extraction hints:** +- This is a confirmation of the Session 4/5 Danish study finding — update existing claim with US real-world corroboration +- New claim candidate: "Oral semaglutide's 2026 FDA approval for weight management may reduce the adherence gap that makes GLP-1 economics fragile under capitation, by eliminating injection barriers for self-pay and telehealth populations" +- The atoms-to-bits framing: "Digital engagement produces 3.5pp additional weight loss vs. GLP-1 alone in real-world US populations — the 'bits' layer amplifies the 'atoms' layer, making behavioral platform integration the value driver in a commoditizing drug market" + +**Context:** JMIR is a high-volume digital health journal; the Diabetes, Obesity and Metabolism (Wiley) publication gives it endocrinology/obesity journal credibility. Retrospective cohort design (not RCT) — selection bias possible (engaged users may be more motivated), but real-world operational data. + +## Curator Notes +PRIMARY CONNECTION: Belief 4 atoms-to-bits + Session 4/5 GLP-1 adherence thread +WHY ARCHIVED: US real-world confirmation of Danish study finding; adds data point for oral semaglutide FDA approval as a potential adherence game-changer +EXTRACTION HINT: Update existing GLP-1 adherence claim with US real-world data; create new claim for oral semaglutide adherence pathway if not already in KB diff --git a/inbox/queue/2025-10-15-cell-reports-medicine-llm-pharmacist-copilot-medication-safety.md b/inbox/queue/2025-10-15-cell-reports-medicine-llm-pharmacist-copilot-medication-safety.md new file mode 100644 index 00000000..af901063 --- /dev/null +++ b/inbox/queue/2025-10-15-cell-reports-medicine-llm-pharmacist-copilot-medication-safety.md @@ -0,0 +1,57 @@ +--- +type: source +title: "Cell Reports Medicine 2025: Pharmacist + LLM Co-pilot Outperforms Pharmacist Alone by 1.5x for Serious Medication Errors" +author: "Multiple authors (Cell Reports Medicine, cross-institutional)" +url: https://pmc.ncbi.nlm.nih.gov/articles/PMC12629785/ +date: 2025-10-15 +domain: health +secondary_domains: [ai-alignment] +format: research-paper +status: unprocessed +priority: medium +tags: [clinical-ai-safety, centaur-model, medication-safety, llm-copilot, pharmacist, clinical-decision-support, rag, belief-5-counter-evidence] +--- + +## Content + +Published in *Cell Reports Medicine*, October 2025 (doi: 10.1016/j.xcrm.2025.00396-9). Prospective, cross-over study. Published in PMC as PMC12629785. + +**Study design:** +- 91 error scenarios based on 40 clinical vignettes across **16 medical and surgical specialties** +- LLM-based clinical decision support system (CDSS) using retrieval-augmented generation (RAG) framework +- Three arms: (1) LLM-based CDSS alone, (2) Pharmacist + LLM co-pilot, (3) Pharmacist alone +- Outcome: accuracy in identifying medication safety errors + +**Key findings:** +- **Pharmacist + LLM co-pilot:** 61% accuracy (precision 0.57, recall 0.61, F1 0.59) +- **Serious harm errors:** Co-pilot mode increased accuracy by **1.5-fold over pharmacist alone** +- Conclusion: "Effective LLM integration for complex tasks like medication chart reviews can enhance healthcare professional performance, improving patient safety" + +**Implementation note:** This used a RAG architecture (retrieval-augmented generation), meaning the LLM retrieved drug information from a curated database rather than relying solely on parametric memory — reducing hallucination risk. + +## Agent Notes + +**Why this matters:** This is the clearest counter-evidence to Belief 5's pessimistic reading in the KB. Where NOHARM shows 22% severe error rates and the Oxford RCT shows zero improvement over controls, this study shows a POSITIVE centaur outcome: pharmacist + LLM outperforms pharmacist alone by 1.5x on the outcomes that matter most (serious harm errors). This is the centaur model working as intended. + +**What surprised me:** The 1.5x improvement on serious harm specifically — not just average accuracy. This means the LLM helps most where the stakes are highest. That's the ideal safety profile: catching the worst errors. The RAG architecture may be key — this isn't a general chat LLM but a structured decision support tool with constrained information retrieval. + +**What I expected but didn't find:** A clear statement of failure conditions. When does the co-pilot model FAIL to improve? The 61% accuracy ceiling suggests the co-pilot mode also misses ~39% of errors. The study doesn't clearly delineate what the LLM adds vs. what it misses. + +**KB connections:** +- Counter-evidence to Sessions 8-11 clinical AI safety concern: the centaur model CAN work in specific conditions (RAG architecture, domain-expert+LLM combination, structured safety task) +- The centaur design requires domain expert + LLM — this is specifically a pharmacist co-pilot, not a physician being replaced +- Connects to NOHARM: NOHARM found 76.6% of severe errors are omissions. If the pharmacist+LLM catches errors the pharmacist alone misses, the omission-detection mechanism is real — but requires the pharmacist to be present and engaged (not automation bias mode) +- The RAG architecture is important: this isn't vulnerable to the misinformation propagation failure mode (Lancet DH 2026) the way a general LLM is, because it retrieves from a curated database +- Connects to the distinction between "clinical reasoning AI" (OE) and "structured CDSS with RAG" (this study) — these are different products with different safety profiles + +**Extraction hints:** +- Primary claim: "LLM-based clinical decision support in co-pilot mode with a domain expert improves serious medication harm detection by 1.5x vs. pharmacist alone — evidence that centaur design works for structured safety tasks using RAG architecture" +- The constraint is important: centaur works when (a) the expert is engaged (not automation bias mode), (b) the LLM uses RAG (not parametric memory), (c) the task is structured (medication safety, 16 specialties) +- This limits the claim — it does NOT say "clinical AI is safe in general" — it says "LLM + expert in a structured RAG setting improves safety for a defined task" + +**Context:** Cell Reports Medicine is a high-tier Cell Press journal for clinical translational research. Prospective cross-over design with clear comparison arms. 16 specialties gives the finding breadth across clinical contexts. + +## Curator Notes +PRIMARY CONNECTION: Belief 5 counter-evidence — centaur model works under specific conditions +WHY ARCHIVED: Best positive clinical AI safety evidence found across 12 sessions; establishes the conditions under which centaur design improves outcomes +EXTRACTION HINT: Extract with explicit scope constraint: centaur + RAG + structured safety task = works; general CDSS + automation bias mode = doesn't work per other evidence diff --git a/inbox/queue/2025-11-01-jmir-knowledge-practice-gap-39-benchmarks-systematic-review.md b/inbox/queue/2025-11-01-jmir-knowledge-practice-gap-39-benchmarks-systematic-review.md new file mode 100644 index 00000000..26914561 --- /dev/null +++ b/inbox/queue/2025-11-01-jmir-knowledge-practice-gap-39-benchmarks-systematic-review.md @@ -0,0 +1,55 @@ +--- +type: source +title: "JMIR 2025 Systematic Review: Knowledge-Practice Performance Gap in Clinical LLMs — Only 5% of 761 Studies Used Real Patient Data" +author: "JMIR authors (systematic review team)" +url: https://www.jmir.org/2025/1/e84120 +date: 2025-11-01 +domain: health +secondary_domains: [ai-alignment] +format: research-paper +status: unprocessed +priority: medium +tags: [clinical-ai-safety, benchmark-performance-gap, llm-evaluation, knowledge-practice-gap, real-world-deployment, belief-5, systematic-review] +--- + +## Content + +Published in *Journal of Medical Internet Research* (JMIR), 2025, Vol. 2025, e84120. Available in PMC as PMC12706444. Systematic review of 761 LLM evaluation studies across clinical medicine, analyzing 39 benchmarks. + +**Key findings:** +- **Only 5%** of 761 LLM evaluation studies assessed performance on real patient care data +- Remaining 95%: relied on medical examination questions (USMLE-style) or case vignettes +- Traditional knowledge-based benchmarks show saturation: leading models achieve 84-90% accuracy on USMLE +- **Conversational frameworks:** Diagnostic accuracy drops from 82% on traditional case vignettes to 62.7% on multi-turn patient dialogues — **a 19.3 percentage point decrease** +- LLMs demonstrate "markedly lower performance on script concordance testing (evaluating clinical reasoning) than on medical multiple-choice benchmarks" +- Review conclusion: "Recent audits reveal substantial disconnects from clinical reality and foundational gaps in construct validity, data integrity, and safety coverage" + +**Related findings (npj Digital Medicine benchmark study):** +- Six LLMs evaluated: average total score 57.2%, safety score 54.7%, effectiveness 62.3% +- **13.3% performance drop in high-risk scenarios** vs. average scenarios + +## Agent Notes + +**Why this matters:** This is the methodological foundation under both the Oxford/Nature Medicine RCT (94.9% → 34.5% deployment gap) and the broader claim that OE's USMLE 100% benchmark performance doesn't predict clinical outcomes. The systematic review establishes that the benchmark-to-reality gap is systematic across the field, not anomalous. The 5% real-patient-data figure is particularly striking: 95% of clinical AI evaluation is done with questions that would never fool a medical student, not with actual clinical workflows. + +**What surprised me:** The 19.3 percentage point drop from case vignettes to multi-turn dialogues. This is the conversational complexity gap — the same model that answers discrete questions well fails in the back-and-forth of real clinical interaction. OE users query OE in conversational clinical language, making this gap directly relevant. + +**What I expected but didn't find:** Any indication that the field is systematically correcting this — moving toward real-patient-data evaluation. The review documents the problem but doesn't identify a trend toward better evaluation practices. + +**KB connections:** +- Methodological foundation for the Oxford/Nature Medicine RCT deployment gap finding +- Directly explains why OE's USMLE 100% benchmark performance (cited in Session 9) doesn't predict clinical safety +- Connects to NOHARM's finding that real clinical scenario evaluation (31 LLMs, complex vignettes) shows 22% severe error rates — vs. USMLE saturation at 84-90% +- The 13.3% performance drop in high-risk scenarios (npj Digital Medicine) maps to NOHARM's finding that omissions cluster in complex, high-acuity scenarios + +**Extraction hints:** +- Primary claim: "95% of clinical LLM evaluation uses medical examination questions rather than real patient care data — a systematic evaluation methodology gap that makes benchmark performance (84-90% USMLE) uninterpretable as a clinical safety signal" +- Secondary: "Conversational frameworks reveal 19.3pp accuracy drop vs. case vignettes, demonstrating that LLMs fail in the back-and-forth interaction that defines actual clinical use" +- This could merge with the Oxford/Nature Medicine source as a unified "benchmark saturation and real-world deployment gap" claim + +**Context:** JMIR is a leading peer-reviewed journal in digital health and health informatics. Systematic review of 761 studies is a large corpus. The PMC availability confirms peer review. + +## Curator Notes +PRIMARY CONNECTION: Belief 5 — clinical AI safety evaluation methodology gap +WHY ARCHIVED: Provides systematic evidence that the KB's reliance on benchmark performance data (e.g., "OE scores 100% on USMLE") is epistemically weak — and establishes that the Oxford RCT deployment gap finding is part of a systematic pattern +EXTRACTION HINT: Extract the 5%/95% finding as a standalone methodological claim about the clinical AI evaluation field; pair with Oxford Nature Medicine RCT as empirical confirmation diff --git a/inbox/queue/2026-01-16-nhs-england-ai-scribing-supplier-registry-19-vendors.md b/inbox/queue/2026-01-16-nhs-england-ai-scribing-supplier-registry-19-vendors.md new file mode 100644 index 00000000..ae0d9af9 --- /dev/null +++ b/inbox/queue/2026-01-16-nhs-england-ai-scribing-supplier-registry-19-vendors.md @@ -0,0 +1,62 @@ +--- +type: source +title: "NHS England AI Scribing Supplier Registry (January 2026): 19 Vendors, DTAC + MHRA Class 1 Required — OpenEvidence Absent" +author: "NHS England / Digital Health Network" +url: https://www.digitalhealth.net/2026/01/nhs-england-launches-supplier-registry-for-ai-scribing-tech/ +date: 2026-01-16 +domain: health +secondary_domains: [ai-alignment] +format: news +status: unprocessed +priority: high +tags: [nhs-dtac, clinical-ai-safety, regulatory-compliance, openevidence, ambient-scribing, mhra, supplier-registry, uk-healthcare, belief-5] +--- + +## Content + +NHS England published a self-certified supplier registry for AI-enabled ambient scribing (Ambient Voice Technology, AVT) on January 16, 2026. The registry was announced in early 2025 and launched following an open application process. + +**Registry requirements for suppliers:** +- Completion of NHS DTAC (Digital Technology Assessment Criteria) assessment +- MHRA Class 1 Medical Device registration with evidence of post-market surveillance +- Proven impact and experience in healthcare environments +- Integration with existing NHS digital infrastructure +- Scalability +- Evidence of meeting stated clinical capabilities + +**The 19 registered vendors (as of January 2026):** +33n, Accurx, Anathem, Aprobrium (Lexacom), Beam Up, Corti, Dictate IT, eConsult, HealthOrbit AI, Heidi Health, Lyrebird Health, Microsoft Dragon, Optum (EMIS), Pungo t/a Joy, Scribetech, Tandem, Tortus, T-Pro, X-On Health. + +**Applications reopened February 3, 2026, and remain open indefinitely.** + +**NHS DTAC V2 update (February 24, 2026):** NHS England published an updated DTAC form with 25% fewer questions, de-duplicated with DSPT and pre-acquisition questionnaire. Deadline: ALL NHS digital health tool procurement must use the new form from April 6, 2026. + +**NHS England April 2025 guidance on AI-enabled ambient scribing:** Mandates full clinical safety case (DCB0160), Data Protection Impact Assessment (DPIA), MHRA medical device determination, DTAC compliance. + +**OpenEvidence "Visits" context:** In August 2025, OE launched "Visits" — a documentation tool that auto-generates clinical notes from patient encounters AND enriches notes with evidence-based guidelines. This is a hybrid documentation+CDSS tool that would need DTAC + MHRA Class 1 to be formally deployed in NHS settings. OE is **not on the 19-vendor registry.** OE's public website contains **no DTAC assessment and no MHRA registration evidence.** + +## Agent Notes + +**Why this matters:** The NHS supplier registry is the regulatory forcing function I hypothesized in Session 11. It's now operational: 19 vendors have met DTAC + MHRA Class 1 requirements. OpenEvidence "Visits" (documentation tool launched August 2025) would directly compete with tools on this registry — but OE has not completed the required compliance steps. OE's stated 2026 UK expansion plans require DTAC compliance for any NHS deployment. This creates a choice point for OE: formalize UK compliance (and thereby disclose clinical safety data) or remain UK individual-clinician only (informal use, not NHS-reimbursed). + +**What surprised me:** OE's absence from the registry despite "Visits" being a clear ambient scribing competitor. The 19-vendor registry includes Microsoft Dragon and Accurx (major players) — OE would be a meaningful addition if it were compliance-ready. Its absence suggests either: (a) OE has not prioritized UK compliance, or (b) OE has not completed DTAC assessment, or (c) OE is pursuing UK expansion through a different channel. Option (b) is consistent with all prior findings. + +**What I expected but didn't find:** Any indication that OE has initiated a DTAC assessment or MHRA Class 1 registration process in anticipation of UK expansion. No press release from OE about EU or UK regulatory compliance has been found across 12 sessions. + +**KB connections:** +- Directly relevant to OE model opacity finding (Sessions 8-11): DTAC compliance REQUIRES clinical safety case disclosure — this is the mechanism that could force the transparency the research literature has demanded +- Connects to NHS England's April 2025 ambient scribing guidance (DCB0160/0129) — OE Visits falls within scope +- Extends the regulatory track finding from Session 11 to a more concrete level: 19 vendors already complied; OE has not +- The DTAC V2 April 6 deadline (13 days from today) codifies the new form but doesn't create new substantive requirements — it's a procedural update + +**Extraction hints:** +- Primary claim: "NHS England's January 2026 AI scribing supplier registry established DTAC completion and MHRA Class 1 registration as compliance requirements for clinical AI documentation tools in NHS settings — OpenEvidence 'Visits' is absent despite being a direct category competitor" +- Secondary claim: "DTAC assessment requires clinical safety case (DCB0160) disclosure — making NHS deployment an indirect forcing function for clinical AI safety transparency that market incentives have not produced" +- This is the UK regulatory equivalent of the EU AI Act (August 2026) for documentation tools specifically + +**Context:** NHS England is the executive body of the NHS in England, responsible for overseeing and commissioning health services. DTAC is its baseline digital governance standard. MHRA (Medicines and Healthcare products Regulatory Authority) is the UK equivalent of FDA for medical devices. + +## Curator Notes +PRIMARY CONNECTION: Session 11 regulatory track finding — NHS DTAC compliance is an observable forcing function +WHY ARCHIVED: Provides concrete evidence that the NHS regulatory compliance mechanism is operational (19 vendors), and that OE is choosing not to comply despite clear competitive incentive +EXTRACTION HINT: Focus on OE's conspicuous absence from registry + what DTAC compliance would require (clinical safety disclosure) — this is the structural gap claim diff --git a/inbox/queue/2026-01-23-obbba-medicaid-work-requirements-implementation-2026-states.md b/inbox/queue/2026-01-23-obbba-medicaid-work-requirements-implementation-2026-states.md new file mode 100644 index 00000000..6c6c928b --- /dev/null +++ b/inbox/queue/2026-01-23-obbba-medicaid-work-requirements-implementation-2026-states.md @@ -0,0 +1,58 @@ +--- +type: source +title: "OBBBA Medicaid Work Requirements: 7 States With Pending Waivers, December 2026 Federal Mandate Deadline" +author: "Ballotpedia News / Georgetown CCF / NASHP / AMA" +url: https://news.ballotpedia.org/2026/01/23/mandatory-medicaid-work-requirements-are-coming-what-do-they-look-like-now/ +date: 2026-01-23 +domain: health +secondary_domains: [] +format: news +status: unprocessed +priority: medium +tags: [obbba, medicaid, work-requirements, vbc, belief-3, structural-misalignment, enrollment-stability, vbc-attractor-state, state-policy] +--- + +## Content + +As of January 23, 2026, implementation progress on OBBBA's Medicaid work requirements: + +**Federal mandate:** All states must implement work requirements by **December 31, 2026**. States that need more time can request HHS extension to 2028. + +**Work requirement terms:** Ages 19-64 must work or participate in qualifying activities ≥80 hours/month to maintain eligibility. Exemptions: parents of children ≤13, medically frail, and others. + +**State-level progress (as of Jan 2026):** +- **7 states with pending Section 1115 waivers:** Arizona, Arkansas, Iowa, Montana, Ohio, South Carolina, Utah. All still pending at CMS as of January 23. +- **Nebraska:** Implementing via state plan amendment (without waiver), ahead of federal mandate. +- **Early implementation states** can proceed immediately; others have until December 31, 2026, or 2028 with extension. + +**Federal funding:** $200M for HHS implementation, $200M for states in FY2026. Required state outreach to beneficiaries: June–August 2026. + +**Scale context:** CBO projected 5.3M people losing Medicaid coverage; implementation timeline confirms this affects 2027 coverage losses (January 1, 2027 mandatory start date was confirmed in Session 8 analysis). + +Supporting sources: Georgetown Center for Children and Families (CCF) analysis of how OBBBA changed the waiver landscape (July 2025); NASHP state-level policy update; AMA changes to Medicaid and ACA overview; King & Spalding detailed healthcare industry review. + +## Agent Notes + +**Why this matters:** The work requirements implementation timeline is on track for the disruption to VBC enrollment stability that Session 8 identified as the primary mechanism by which OBBBA threatens the attractor state thesis. The December 2026 deadline means observable effects will begin January 2027. The 7-state waiver pipeline shows early-mover states are actively pursuing implementation — this is not administrative stall. + +**What surprised me:** The Nebraska precedent — implementing without a waiver via state plan amendment. This suggests states don't even need CMS waiver approval to proceed; they can use a state plan amendment if the OBBBA statutory requirement is self-executing. This accelerates the timeline. + +**What I expected but didn't find:** Any substantial state-level resistance or legal challenges blocking implementation. The OBBBA work requirements appear to be proceeding through regulatory channels without the court injunctions that blocked Obama-era waiver work requirements. The political landscape has shifted. + +**KB connections:** +- Directly extends Session 8 finding on OBBBA + VBC enrollment stability (Belief 3) +- The December 2026 deadline means VBC plan enrollment disruption begins Q1 2027 — this is the window to watch for BALANCE model implementation being tested against enrollment fragmentation +- Connects to OBBBA's 5.3M coverage loss (CBO) — these are disproportionately working-age adults with chronic conditions, exactly the population VBC risk-bearing plans need for prevention economics +- The June-August 2026 required state outreach is a potential signal point: if states fail to effectively notify beneficiaries, coverage loss will exceed CBO estimates + +**Extraction hints:** +- This is an implementation status update for the Session 8 OBBBA claim — update the existing claim with: "seven states have pending waivers, Nebraska proceeding without waiver, December 2026 mandatory deadline confirmed" +- Primary new claim: "OBBBA Medicaid work requirements are on track for December 2026 implementation with 7 states seeking early waivers and Nebraska proceeding via state plan amendment — enrollment disruption for VBC prevention economics begins Q1 2027" +- Don't create a new claim; update the existing OBBBA source with this timeline confirmation + +**Context:** Ballotpedia News provides nonpartisan tracking of state/federal policy; Georgetown CCF is the leading Medicaid policy research center. AMA and NASHP provide clinical/public health perspective. Cross-source consistency confirms the timeline. + +## Curator Notes +PRIMARY CONNECTION: Belief 3 "structural misalignment" + OBBBA enrollment stability mechanism from Session 8 +WHY ARCHIVED: Implementation update confirming that the December 2026 OBBBA enrollment disruption is on track — the KB needs to update confidence from "projected" to "in-progress" +EXTRACTION HINT: Update the existing OBBBA claim rather than creating a new one; the observation period is Q1 2027 when work requirements take full effect diff --git a/inbox/queue/2026-02-10-oxford-nature-medicine-llm-public-medical-advice-rct.md b/inbox/queue/2026-02-10-oxford-nature-medicine-llm-public-medical-advice-rct.md new file mode 100644 index 00000000..be204d53 --- /dev/null +++ b/inbox/queue/2026-02-10-oxford-nature-medicine-llm-public-medical-advice-rct.md @@ -0,0 +1,59 @@ +--- +type: source +title: "Nature Medicine 2026: LLM Clinical Knowledge Does Not Translate to User Interactions — RCT With 1,298 Participants" +author: "Oxford Internet Institute & Nuffield Dept of Primary Care (University of Oxford, MLCommons et al.)" +url: https://www.nature.com/articles/s41591-025-04074-y +date: 2026-02-10 +domain: health +secondary_domains: [ai-alignment] +format: research-paper +status: unprocessed +priority: high +tags: [clinical-ai-safety, llm-medical-advice, real-world-deployment, benchmark-performance-gap, automation-bias, public-health-ai, belief-5, oxford] +flagged_for_theseus: ["Real-world deployment gap between LLM benchmark performance and user interaction outcomes — AI safety/alignment implication beyond healthcare"] +--- + +## Content + +Published in *Nature Medicine*, February 2026 (Vol. 32, p. 609–615). Lead institution: Oxford Internet Institute and Nuffield Department of Primary Care Health Sciences, University of Oxford. Randomized, preregistered study with 1,298 participants. + +**Study design:** Participants were randomly assigned to use an LLM (GPT-4o, Llama 3, Command R+) or a source of their choice (control) to navigate 10 medical scenarios. Measured: correct condition identification and appropriate disposition (e.g., seek emergency care vs. wait-and-see). + +**Key findings:** +- **LLMs tested alone:** Correctly identified conditions in **94.9%** of cases; correct disposition in **56.3%** on average (state-of-the-art benchmark performance). +- **Participants using LLMs:** Identified relevant conditions in **fewer than 34.5%** of cases; disposition in **fewer than 44.2%** — **NO BETTER THAN CONTROL GROUP** using traditional methods (online search, own judgment). +- The gap: 94.9% → 34.5% condition accuracy (a 60-percentage-point collapse) in real user interaction. +- Root cause: **"Two-way communication breakdown"** — users didn't know what information the LLMs needed; LLM responses frequently mixed good and poor recommendations, making it difficult to identify correct action. +- Study conclusion: "Current evaluation methods do not reflect the complexity of interacting with human users." +- Key call: "Just as clinical trials are required for medications, AI systems need rigorous testing with diverse, real users to understand their true capabilities." + +Press coverage: University of Oxford newsroom (Feb 10), The Register ("AI chatbots don't improve medical advice, study finds"), NIHR Oxford BRC. + +**Important scope note:** This study evaluated PUBLIC use (general population navigating medical scenarios) — NOT physician use (like OpenEvidence). But the underlying mechanism (communication breakdown, mixed-quality response interpretation) is not specific to untrained users. + +## Agent Notes + +**Why this matters:** This is a NEW (fifth) clinical AI safety failure mode distinct from the four documented in Sessions 8-11: (1) omission-reinforcement, (2) demographic bias amplification, (3) automation bias robustness, (4) medical misinformation propagation. This fifth mode is the **real-world deployment gap** — LLMs perform well in isolation on benchmarks but this performance does not translate to improved user outcomes in actual interaction. The 60-percentage-point gap between LLM solo performance (94.9%) and user-assisted performance (<34.5%) is structurally important. + +**What surprised me:** The control group performed comparably to the LLM-assisted group. This means LLMs added ZERO measurable benefit over existing information-seeking behavior for the general public in medical scenarios. This is not "LLMs made things worse" (no harm signal) — it's "LLMs failed to improve over what people already do." That's the null result that clinical AI proponents have never wanted to confront directly. + +**What I expected but didn't find:** A nuanced finding that better-designed LLMs (GPT-4o vs. Llama 3) outperformed simpler ones in real-world use. The study used three different LLMs and the result held across all — it's the INTERACTION mode, not the model, that explains the gap. + +**KB connections:** +- Fifth distinct clinical AI safety failure mode: "real-world deployment gap" (benchmark performance does not predict user-assisted outcome improvement) +- Directly relevant to the JMIR 2025 systematic review finding that only 5% of LLM evaluations used real patient care data — this study is part of the ~5% that does +- Connects to OE's USMLE 100% benchmark performance cited in the knowledge base — if OE is tested alone it likely performs at benchmark; but physician interactions with OE may suffer from a similar deployment gap +- Compounds with automation bias finding (NCT06963957): physicians defer to AI even when it's wrong; public users fail to extract correct guidance even when AI knows the right answer. Two different failure modes, both erasing clinical value. +- Connects to the Knowledge-Practice Gap systematic review (JMIR 2025 — 39 benchmarks, only 5% real patient data) + +**Extraction hints:** +- Primary claim: "LLMs achieve 94.9% condition identification accuracy in isolation but participants using the same LLMs perform no better than control groups (<34.5%), establishing a real-world deployment gap between LLM knowledge and user-assisted outcome improvement" +- The deployment gap is a SCOPE issue: OE is physician-facing (not public-facing), so the mechanism may be weaker for OE — but the zero-improvement-over-control result for informed users is still a serious evidentiary challenge to clinical AI value claims +- Flag this for Theseus: the benchmark-to-deployment gap is a general AI safety concern, not just healthcare-specific + +**Context:** Oxford Internet Institute is a leading AI-society research center. MLCommons co-sponsorship adds credibility (they also run HELM benchmarks). Published in Nature Medicine — highest-tier clinical AI venue. Preregistered RCT — highest evidence level. + +## Curator Notes +PRIMARY CONNECTION: Belief 5 "clinical AI augments but creates novel safety risks requiring centaur design" — fifth failure mode documented +WHY ARCHIVED: Establishes the real-world deployment gap as distinct from automation bias; challenges the assumption that high benchmark performance predicts improved clinical outcomes +EXTRACTION HINT: Extract as standalone claim — distinguish from automation bias (different mechanism: there, physician defers to wrong AI; here, user fails to extract correct guidance from right AI) diff --git a/inbox/queue/2026-02-24-nhs-dtac-v2-updated-form-april-6-deadline.md b/inbox/queue/2026-02-24-nhs-dtac-v2-updated-form-april-6-deadline.md new file mode 100644 index 00000000..c1e5095b --- /dev/null +++ b/inbox/queue/2026-02-24-nhs-dtac-v2-updated-form-april-6-deadline.md @@ -0,0 +1,61 @@ +--- +type: source +title: "NHS DTAC V2 (February 2026): Updated Form With 25% Fewer Questions, Mandatory From April 6, 2026" +author: "NHS England / Periculo Cyber / Acorn Compliance" +url: https://www.periculo.co.uk/cyber-security-blog/dtac-version-2-what-digital-health-organisations-need-to-know-before-6th-april-2026 +date: 2026-02-24 +domain: health +secondary_domains: [] +format: news +status: unprocessed +priority: low +tags: [nhs-dtac, regulatory-compliance, digital-health, uk-healthcare, clinical-ai-safety, belief-5] +--- + +## Content + +NHS England published an updated DTAC form on February 24, 2026. Key changes: + +**What changed:** +- 25% reduction in questions +- De-duplicated with: DSPT (Data Security and Protection Toolkit) and pre-acquisition questionnaire +- Clearer guidance on DTAC's purpose, scope, and how to complete assessments + +**What DIDN'T change:** +- The five core DTAC domains: Clinical Safety, Data Protection, Technical Security, Interoperability, Usability & Accessibility +- The substantive clinical safety requirements (DCB0129/DCB0160) +- The requirement for all NHS digital health tool procurement to use DTAC assessment + +**Implementation:** +- Previous version NOT to be used from April 6, 2026 onwards +- Suppliers already on NHS supplier registries must transition to new form + +**This is a PROCEDURAL update, not a new substantive requirement.** The compliance bar for clinical AI tools has not been raised or lowered — it's been streamlined. + +Source also: Periculo Cyber (cyber security compliance specialists), Acorn Compliance (healthtech compliance), NHS Transformation Directorate guidance portal. + +## Agent Notes + +**Why this matters (or why it matters less than I anticipated):** When researching the "April 6 deadline" from Session 11, I expected to find new substantive requirements. Instead, it's a form update — 25% fewer questions, better documentation. This is administrative streamlining, not a regulatory tightening. The "mandatory" framing in NHS communications made this sound like a new compliance gate; it's actually just a form swap. + +**What surprised me:** The de-duplication with DSPT and pre-acquisition questionnaire. This reduces friction for suppliers completing DTAC — it makes compliance EASIER, not harder. This partially undermines the "regulatory pressure forcing OE to disclose safety data" thesis from Session 11 — DTAC V2 is less burdensome, not more. + +**What I expected but didn't find:** New Annex-III-style requirements for clinical AI specifically. The DTAC V2 update is general digital health governance (applies to apps, devices, platforms) — there's no AI-specific clinical safety update analogous to EU AI Act's Annex III. That remains a gap in UK regulation. + +**KB connections:** +- This corrects an overstatement from Session 11: "NHS DTAC V2 is a mandatory clinical safety standard" is accurate but the "April 6, 2026 deadline" was framed as more consequential than it is +- The substantive compliance requirement is DCB0160 (clinical safety risk assessment) — unchanged +- The real regulatory pressure comes from the supplier registry (January 2026) and NHS procurement requirements — not DTAC V2 specifically +- Does NOT represent a new forcing function for OE safety disclosure; suppliers already using previous DTAC form just switch forms + +**Extraction hints:** +- Do NOT create a standalone claim for "DTAC V2 creates new compliance requirements" — it doesn't +- The relevant claim is already in the KB or in the supplier registry source: "NHS procurement of digital health tools requires DTAC assessment + clinical safety case (DCB0160)" +- This source is primarily a CORRECTION of Session 11's slightly elevated framing of the April 6 deadline + +**Context:** Multiple compliance advisory firms (Periculo, Acorn) confirm this interpretation — DTAC V2 is an administrative update, not a new compliance threshold. + +## Curator Notes +PRIMARY CONNECTION: Session 11 regulatory track finding — corrects overstatement about April 6 deadline significance +WHY ARCHIVED: Prevents future sessions from treating the DTAC V2 April 6 deadline as a major regulatory event — it's a form update, not a new substantive requirement +EXTRACTION HINT: Do not extract as a standalone claim; use as context correction for Session 11 regulatory track framing diff --git a/inbox/queue/2026-03-10-abrams-bramajo-pnas-birth-cohort-mortality-us-life-expectancy.md b/inbox/queue/2026-03-10-abrams-bramajo-pnas-birth-cohort-mortality-us-life-expectancy.md new file mode 100644 index 00000000..4244911b --- /dev/null +++ b/inbox/queue/2026-03-10-abrams-bramajo-pnas-birth-cohort-mortality-us-life-expectancy.md @@ -0,0 +1,57 @@ +--- +type: source +title: "PNAS 2026: US Life Expectancy Stagnation Rooted in Post-1970 Birth Cohort Mortality Deterioration" +author: "Abrams & Bramajo et al. (UTMB researchers)" +url: https://www.pnas.org/doi/full/10.1073/pnas.2519356123 +date: 2026-03-10 +domain: health +secondary_domains: [] +format: research-paper +status: unprocessed +priority: high +tags: [life-expectancy, deaths-of-despair, birth-cohort, cardiovascular-disease, cancer, external-causes, mortality-trends, healthspan, belief-1] +--- + +## Content + +Published in *Proceedings of the National Academy of Sciences*, March 9-10, 2026, by UTMB researchers. Using Lexis diagrams, the study analyzed mortality changes from 1979–2023 for all-cause mortality and three cause groups (cardiovascular disease, cancer, external causes) across cohorts born between the 1890s and 1980s. + +**Key findings:** +- The **1950s birth cohort** is the inflection point: general improvements in earlier cohorts gave way to deterioration in later cohorts. +- Cohorts born **since 1970** exhibit **increasing mortality in cardiovascular disease, cancer, AND external causes** compared to their predecessors — across all three cause groups simultaneously. +- A **broad period-based mortality deterioration beginning around 2010** affected nearly every living adult cohort at the time, driven primarily by cardiovascular disease mortality. +- These patterns portend **"an unprecedented longer-run stagnation, or even sustained decline, in US life expectancy."** +- Stagnating life expectancy is "not the result of a single cause but a complex convergence of rising chronic disease, shifting behavioral risks, and increases in certain cancers among younger adults." + +Context: CDC separately released 2024 life expectancy data showing US LE reached 79.0 years (up 0.6 from 78.4 in 2023) — a modest COVID/overdose mortality recovery. But the PNAS cohort analysis shows this surface improvement masks structural deterioration embedded in younger cohorts. + +Companion piece: PNAS paper "Cohort mortality forecasts indicate signs of deceleration in life expectancy gains" (doi: 10.1073/pnas.2519179122) from same period, using cohort mortality forecasts to confirm deceleration. + +Coverage: News-Medical.net (March 10), UTMB newsroom (March 9), Subodh Verma MD on X summarizing the key cohort finding. + +## Agent Notes + +**Why this matters:** This is the strongest structural confirmation of Belief 1 (healthspan as civilization's binding constraint) in the past year. It's not just deaths of despair (drug overdoses — which temporarily surged and are now recovering) — it's a cohort-level deterioration across cardiovascular disease, cancer, AND external causes in Americans born after 1970. This is multi-causal, structural, and worsening. + +**What surprised me:** The 2010 period-effect deteriorating EVERY adult cohort simultaneously. This isn't just a younger generation problem — something happened around 2010 that made ALL adult cohorts sicker. That's not a behavioral cohort story; it's a systemic environment story. This is highly relevant to the "compounding failure" framing of Belief 1. + +**What I expected but didn't find:** Evidence of a genuine reversal or plateau in deaths-of-despair as a sign that the healthspan problem is self-correcting. The CDC's +0.6 year LE improvement in 2024 might have suggested recovery. The PNAS cohort analysis shows this is surface-level optimism — the structural problem is in the cohort trajectory. + +**KB connections:** +- Directly strengthens Belief 1 ("Healthspan Is Civilization's Binding Constraint") — the compounding failure is confirmed across multiple cause categories +- Extends the deaths-of-despair framing: not just drug overdoses, but CVD and cancer also deteriorating in post-1970 cohorts +- Connects to Belief 2 (80-90% non-clinical determinants) — if this is "rising chronic disease, shifting behavioral risks, and behavioral cancers," that's entirely within the non-clinical determinant zone +- The "2010 period effect" is a potential new claim candidate: something environmental/social changed system-wide around 2010 + +**Extraction hints:** +- Primary claim: "US life expectancy stagnation is driven by a cohort-level mortality deterioration in Americans born after 1970 spanning CVD, cancer, and external causes — not a single-cause problem" +- Secondary claim: "A period-based mortality deterioration beginning around 2010 affected nearly every adult US cohort simultaneously, suggesting systemic environmental/behavioral causes beyond cohort effects" +- Belief 1 update candidate: temporal language should shift from "binding constraint" to "worsening binding constraint with compounding cohort dynamics" +- Counter-note: CDC 2024 shows +0.6 LE recovery — should be noted as COVID/overdose surface recovery, not structural improvement + +**Context:** UTMB = University of Texas Medical Branch. Lead researchers Abrams and Bramajo. Independently confirmed by PNAS companion paper. This is peer-reviewed, large-n historical analysis — highest quality evidence for longitudinal claims. + +## Curator Notes +PRIMARY CONNECTION: Belief 1 "healthspan is civilization's binding constraint" — structural confirmation +WHY ARCHIVED: Direct disconfirmation target for Belief 1 in Session 12; result is that Belief 1 is CONFIRMED and STRENGTHENED, not disconfirmed +EXTRACTION HINT: Extract as TWO claims: (1) post-1970 cohort mortality deterioration across CVD+cancer+external causes; (2) 2010 period-effect deteriorating all adult cohorts simultaneously — these have different causal implications diff --git a/inbox/queue/2026-03-10-cdc-us-life-expectancy-2024-79-years.md b/inbox/queue/2026-03-10-cdc-us-life-expectancy-2024-79-years.md new file mode 100644 index 00000000..b2f3c62e --- /dev/null +++ b/inbox/queue/2026-03-10-cdc-us-life-expectancy-2024-79-years.md @@ -0,0 +1,59 @@ +--- +type: source +title: "CDC NCHS 2025: US Life Expectancy Rose to 79.0 Years in 2024 — Recovery From COVID/Overdose Trough, Not Structural Improvement" +author: "CDC National Center for Health Statistics" +url: https://www.cdc.gov/nchs/products/databriefs/db548.htm +date: 2025-11-01 +domain: health +secondary_domains: [] +format: government-data +status: unprocessed +priority: medium +tags: [life-expectancy, deaths-of-despair, mortality-trends, belief-1, healthspan, cdc, public-health] +--- + +## Content + +CDC NCHS Data Brief 548: "Mortality in the United States, 2024." + +**Key statistics:** +- Life expectancy at birth, 2024: **79.0 years** (up 0.6 years from 78.4 in 2023) +- This represents the third consecutive year of improvement after the COVID trough (2020-2021 lows) + +**Context from PNAS 2026 cohort analysis (Abrams & Bramajo):** +The surface improvement from 79.0 years masks a structural cohort problem: +- Post-1970 cohorts are dying earlier than predecessors from CVD, cancer, AND external causes +- The 2010 period-effect deterioration affected every adult cohort +- PNAS projects "unprecedented longer-run stagnation or even sustained decline" despite current surface recovery + +**Interpretation:** The 2024 recovery is primarily from lower COVID mortality and some stabilization in drug overdose deaths. It does NOT reflect structural improvement in the non-clinical determinants that drive the cohort trajectory. + +**Rising deaths of despair (2025 reporting):** +- North America continues to show rising deaths of despair among young adults +- Drug-related mortality "drives almost all of the post-2012 growth" in the life expectancy disadvantage for White, Black, and Hispanic Americans (PMC analysis) +- Le Monde (2025): while global LE is climbing again, US and Canada have flat/falling numbers due to preventable deaths among younger people + +## Agent Notes + +**Why this matters:** The CDC surface recovery (+0.6 years in 2024) is exactly the kind of data point that could be used to challenge Belief 1 — "look, US life expectancy is improving." The PNAS cohort analysis (Abrams & Bramajo, March 2026) is the needed context: the surface recovery is real, but the cohort dynamics are structural and worsening. These two data sources must be read together. + +**What surprised me:** The 2024 recovery is faster than expected (three consecutive years of improvement). This creates a real rhetorical challenge to the "compounding failure" framing — someone citing 79.0 years and a three-year improvement trend could make a plausible case that the US health system is self-correcting. + +**What I expected but didn't find:** Any CDC analysis of the cohort vs. period effect distinction. The NCHS data brief reports aggregate life expectancy without decomposing into cohort vs. period effects — that analysis required the PNAS researchers. The KB needs BOTH sources together to give an accurate picture. + +**KB connections:** +- Must be paired with PNAS 2026 cohort study — surface improvement vs. structural deterioration +- Directly relevant to Belief 1 disconfirmation attempt: the 2024 improvement is real but not structural +- The OBBBA's projected 16,000 preventable deaths/year (from Session 8, Annals of Internal Medicine) would show up as a reversal of this trend in 2027-2028 data — important future observation point + +**Extraction hints:** +- Do NOT create a standalone claim for "life expectancy improved to 79.0 in 2024" without the structural context +- The claim should be: "The 2024 US life expectancy recovery to 79.0 years reflects lower COVID/overdose mortality rather than structural improvement in health determinants — post-1970 cohort mortality trajectories continue to deteriorate across CVD, cancer, and external causes (PNAS 2026)" +- This is a nuanced claim: surface improvement + structural deterioration are both true simultaneously + +**Context:** CDC NCHS is the authoritative source for US mortality statistics. Data brief is the primary publication format for national vital statistics. + +## Curator Notes +PRIMARY CONNECTION: Belief 1 disconfirmation context — why the surface recovery doesn't weaken the compounding failure thesis +WHY ARCHIVED: Necessary counter-context for any KB claim about recent US life expectancy improvement; prevents misleading extraction of positive trend without structural caveat +EXTRACTION HINT: Archive as paired with PNAS 2026 cohort study; the claim requires both sources to be accurate diff --git a/inbox/queue/2026-03-10-uk-lords-inquiry-nhs-ai-personalised-medicine.md b/inbox/queue/2026-03-10-uk-lords-inquiry-nhs-ai-personalised-medicine.md new file mode 100644 index 00000000..ef0075d4 --- /dev/null +++ b/inbox/queue/2026-03-10-uk-lords-inquiry-nhs-ai-personalised-medicine.md @@ -0,0 +1,58 @@ +--- +type: source +title: "UK House of Lords Science & Technology Committee: NHS AI and Personalised Medicine Inquiry Launched March 2026" +author: "UK Parliament / House of Lords Science and Technology Committee" +url: https://committees.parliament.uk/work/9659/ +date: 2026-03-10 +domain: health +secondary_domains: [] +format: policy-document +status: unprocessed +priority: medium +tags: [nhs, clinical-ai-safety, uk-policy, regulatory-pressure, personalised-medicine, innovation-adoption, belief-3, belief-5] +--- + +## Content + +The House of Lords Science and Technology Committee launched a new inquiry: **"Innovation in the NHS: Personalised Medicine and AI"** in March 2026. + +**Core question:** Why does the NHS struggle to adopt the UK's cutting-edge life sciences innovations — and what could be done to fix it? + +**Focus areas:** +- The gap between early-stage research, clinical trials, and NHS-wide delivery +- Blockages in the system: procurement processes, clinical pathways, regulators, professional bodies +- Personalised medicine as a case study for AI adoption more broadly + +**Timeline:** +- First evidence session: March 10, 2026 (Professor Sir Mark Caulfield, 100,000 Genomes Project) +- Written evidence deadline: April 20, 2026 +- Inquiry ongoing through 2026 + +**Coverage:** UK Parliament website, HTN Health Tech News, Precision Medicine Online, Pathology News. + +## Agent Notes + +**Why this matters:** The UK Parliament is now investigating the SAME structural problem that Sessions 3-11 have been documenting: the gap between innovation (clinical AI capability) and adoption (NHS deployment). The Lords inquiry is asking the identical question from a policy/governance perspective. This is a new mechanism that could force regulatory or procurement reform — different from the DTAC V2 form update, this is a parliamentary scrutiny process that can produce binding recommendations. + +**What surprised me:** The inquiry launched the same week as the PNAS birth cohort mortality study (March 9-10, 2026) and the DTAC V2 form publication — a week where multiple structural UK health/AI regulatory signals emerged simultaneously. This isn't coincidental; it reflects a broader 2026 UK reckoning with NHS AI adoption. + +**What I expected but didn't find:** Specific mention of clinical AI safety governance as a focus area. The inquiry appears focused on ADOPTION (why isn't AI getting into the NHS?) rather than SAFETY (is the AI that's being adopted safe?). This is the mirror image of the research concern — the research community worries about unsafe AI being adopted too fast; the Lords are worried about safe AI being adopted too slowly. + +**KB connections:** +- Directly relevant to the "commercial-research-regulatory trifurcation" meta-finding from Session 11 — a fourth UK-specific track is now emerging (parliamentary scrutiny) +- The procurement blockage focus connects to VBC adoption stall (Belief 3): the same institutional friction that prevents VBC adoption also slows clinical AI adoption +- The "personalised medicine and AI" framing is directly relevant to Belief 4 (atoms-to-bits): the inquiry covers genomics + AI — the intersection of biological data and digital delivery +- If the inquiry produces recommendations on NHS AI procurement governance, this could affect DTAC requirements, NICE ESF thresholds, or MHRA device classification for clinical AI tools + +**Extraction hints:** +- Not yet extractable as a claim — the inquiry is ongoing, no findings yet +- Archive as a FUTURE WATCH: inquiry findings expected late 2026/early 2027 +- The important extract will be when the inquiry REPORTS — specifically if it recommends AI safety disclosure requirements that go beyond current DTAC/MHRA frameworks +- Flag for future session: check for interim evidence submissions and witness testimony that may contain useful clinical AI safety evidence + +**Context:** House of Lords Science and Technology Committee is a standing parliamentary committee with power to conduct inquiries, take evidence, and produce reports with government-response obligations. Professor Sir Mark Caulfield is the most credible UK genomics expert (led 100,000 Genomes Project). The inquiry framing around procurement blockages suggests frustration with NHS procurement conservatism — potential tailwind for clinical AI adoption even as safety concerns mount. + +## Curator Notes +PRIMARY CONNECTION: Regulatory track from Session 11 + Belief 3 structural misalignment +WHY ARCHIVED: New UK policy mechanism that could affect NHS AI governance in 2026-2027; inquiry framing (adoption blockage) is different from EU AI Act (safety requirements) +EXTRACTION HINT: Watch for inquiry report (expected late 2026 or early 2027); the recommendations may create new NHS AI governance standards that bridge the commercial-research gap from the supply/procurement side diff --git a/inbox/queue/2026-03-20-iatrox-openevidence-uk-dtac-nice-esf-governance-review.md b/inbox/queue/2026-03-20-iatrox-openevidence-uk-dtac-nice-esf-governance-review.md new file mode 100644 index 00000000..8019692f --- /dev/null +++ b/inbox/queue/2026-03-20-iatrox-openevidence-uk-dtac-nice-esf-governance-review.md @@ -0,0 +1,72 @@ +--- +type: source +title: "iatroX Clinical AI Insights 2026: OpenEvidence Has No DTAC Assessment or MHRA Registration for UK Deployment — US-Centric Corpus Adds Clinical Risk" +author: "iatroX Clinical AI Insights" +url: https://www.iatrox.com/blog/openevidence-chatgpt-5-medwise-ai-iatrox-uk-clinicians-dtac-nice-esf +date: 2026-03-20 +domain: health +secondary_domains: [] +format: blog-analysis +status: unprocessed +priority: medium +tags: [openevidence, nhs-dtac, nice-esf, uk-healthcare, clinical-ai-safety, belief-5, regulatory-compliance, corpus-bias] +--- + +## Content + +iatroX Clinical AI Insights is a UK-focused clinical AI review publication that evaluates tools through the lens of NHS governance requirements (DTAC, NICE Evidence Standards Framework, MHRA). Multiple 2025-2026 reviews compare OpenEvidence against UK-compliant alternatives. + +**Key findings from multiple iatroX reviews:** + +**1. OE UK governance status:** +- "OpenEvidence's UK-specific governance (DTAC/DCB) is not explicitly positioned on its public pages" +- OE qualifies as a US-focused tool being used informally by UK clinicians — not formally NHS-deployed +- OE has no published DTAC assessment, no MHRA Class 1 registration listed, no NICE ESF submission + +**2. US-centric corpus clinical risk:** +- OE is "built on a US-centric corpus" +- May cite AHA (American Heart Association) guidelines instead of NICE guidelines +- May suggest FDA-approved drugs that are: (a) not licensed in the UK, or (b) not cost-effective for NHS prescribing (not on formulary) +- May reference dosing standards or treatment pathways that differ from BNF (British National Formulary) +- This is a CLINICAL SAFETY RISK for UK physicians, distinct from the demographic bias or automation bias documented in prior sessions + +**3. OE 2026 UK expansion signals:** +- OE has "signalled plans for global expansion as a key 2026 and beyond initiative" +- UK, Canada, Australia identified as "English-first markets with lower regulatory barriers" +- But "lower regulatory barriers" perception may be inaccurate for UK: NHS requires DTAC + MHRA Class 1 for formal deployment + +**4. OE "Visits" documentation tool (August 2025):** +- OE Visits auto-generates clinical notes + enriches with evidence-based guidelines +- Described as "hybrid documentation+CDSS" — directly competes with the 19 registered NHS AVT suppliers +- Not on NHS England's supplier registry (launched January 2026) +- Would require DTAC + MHRA Class 1 for formal NHS procurement + +**5. UK landscape context:** +- UK-native compliant alternatives exist: iatroX, Medwise AI, Praktiki, Pathway — all DTAC-compliant with UK guideline corpus +- NHS England's April 2025 ambient scribing guidance requires clinical safety case (DCB0160), DPIA, mandatory human verification + +## Agent Notes + +**Why this matters:** iatroX provides the clearest independent assessment of what OE's governance gap means for UK clinical practice. The corpus risk is a different category from the demographic bias / automation bias concerns documented in prior sessions — it's not about LLM failure modes but about CONTENT misalignment with clinical practice guidelines. A UK physician querying OE about hypertension management may receive AHA recommendations (different thresholds than NICE) or be directed to drugs not available on NHS formulary. This is immediately actionable clinical harm, not a probabilistic risk. + +**What surprised me:** OE characterizing UK as a market with "lower regulatory barriers" relative to the US. The UK NHS actually has MORE formal digital health procurement governance than the US (no equivalent to DTAC in the US at federal level). OE's US-market framing may be a strategic misjudgment about UK regulatory requirements. + +**What I expected but didn't find:** Any indication that OE has begun a DTAC assessment process in preparation for its stated 2026 UK expansion. Given the January 2026 supplier registry launch and April 6 DTAC V2 deadline, OE has had 3+ months to begin compliance — and no announcement. + +**KB connections:** +- New failure mode for OE in UK context: US corpus → guideline mismatch → wrong recommendations for UK practice (distinct from demographic bias, automation bias, misinformation propagation) +- Directly extends the OE safety opacity thread from Sessions 8-11 into the UK market context +- The 19-vendor registry provides UK competitive context: OE Visits is behind UK-native tools in governance compliance +- Connects to the EU AI Act forcing function: if OE targets UK/EU expansion, regulatory compliance is not optional + +**Extraction hints:** +- New claim: "OpenEvidence's US-centric corpus creates a clinical safety risk for UK physicians that is distinct from LLM failure modes: AHA vs. NICE guideline misalignment and off-formulary drug suggestions in a market where OE has no DTAC assessment or MHRA registration" +- This claim is PROVEN (the governance gap is documented; the corpus misalignment is documented; no counter-evidence from OE) +- This is a UK-specific extension of the Session 11 "OE model opacity" finding — different mechanism, same transparency gap + +**Context:** iatroX is an independent UK clinical AI review publication. Not affiliated with any AI company. Reviews are conducted from a clinical governance perspective. Multiple consistent reviews across 2025-2026 confirm the governance gap. + +## Curator Notes +PRIMARY CONNECTION: OE model opacity thread (Sessions 8-11) — extended to UK clinical corpus mismatch +WHY ARCHIVED: Provides a previously undocumented clinical risk category for OE in non-US markets: guideline mismatch, not just LLM failure modes +EXTRACTION HINT: Extract as "OE UK deployment risk" claim, keeping scope to UK clinical practice (NICE vs. AHA corpus misalignment); link to DTAC absence finding