222 lines
21 KiB
Markdown
222 lines
21 KiB
Markdown
---
|
|
status: developed
|
|
type: musing
|
|
stage: complete
|
|
created: 2026-03-24
|
|
last_updated: 2026-03-24
|
|
tags: [clinical-ai-safety, nhs-dtac, eu-ai-act, regulatory-compliance, openevidence, belief-5-disconfirmation, belief-1-disconfirmation, deaths-of-despair, healthspan, pnas-cohort-mortality, real-world-deployment-gap, centaur-model, pharmacist-copilot, lords-inquiry, obbba, glp1-digital]
|
|
---
|
|
|
|
# Research Session 12: Keystone Belief Confirmed and Strengthened; Regulatory Track Clarified; Fifth Clinical AI Failure Mode
|
|
|
|
## Research Question
|
|
|
|
**Are clinical AI companies actually preparing for NHS DTAC V2 (April 6, 2026) and EU AI Act (August 2026) — and does emerging regulatory compliance behavior represent the first observable closing of the commercial-research gap? Secondary: what does new evidence say about deaths of despair and US life expectancy (Belief 1 disconfirmation attempt)?**
|
|
|
|
## Why This Question
|
|
|
|
Two concurrent targets:
|
|
|
|
**Thread A (primary — regulatory track from Session 11):** The NHS DTAC V2 April 6 deadline was framed in Session 11 as a major compliance moment. Session 12 tested whether this was substantive. Secondary: does the NHS supplier registry (19 vendors, January 2026) represent the actual compliance mechanism?
|
|
|
|
**Thread B (Belief 1 disconfirmation):** Belief 1 hasn't been targeted since Session 7 (March 19). The CDC's +0.6 year LE improvement in 2024 represents the strongest surface-level evidence against the "compounding failure" thesis. Can it be used to challenge the keystone belief?
|
|
|
|
**Disconfirmation targets:**
|
|
- Belief 5: Does emerging regulatory compliance or the pharmacist+LLM co-pilot evidence undermine the pessimistic clinical AI safety reading?
|
|
- Belief 1: Does the 2024 US LE recovery to 79.0 years, or any new deaths of despair data, suggest self-correction in the healthspan binding constraint?
|
|
|
|
---
|
|
|
|
## What I Found
|
|
|
|
### Finding 1: DTAC V2 April 6 Deadline Is Administrative — Less Consequential Than Session 11 Framed
|
|
|
|
**Correction:** NHS DTAC V2 (published February 24, 2026) is a **form update** (25% fewer questions, de-duplication with DSPT and pre-acquisition questionnaire). The April 6 deadline is the date when the old form must be retired, not a new substantive compliance gate. The clinical safety requirements (DCB0160, DCB0129) are unchanged.
|
|
|
|
**What IS the consequential mechanism:** The NHS England AI Scribing Supplier Registry (launched January 16, 2026) with 19 vendors meeting DTAC + MHRA Class 1 requirements. This registry is operational and open for new applications. THAT is the forcing function, not the DTAC V2 form deadline.
|
|
|
|
**Key observation:** OpenEvidence is absent from the 19-vendor registry despite OE "Visits" (documentation tool, August 2025) being a direct category competitor. OE's public website contains no DTAC assessment and no MHRA Class 1 registration. OE has signaled 2026 UK expansion targeting UK, Canada, Australia as "English-first markets with lower regulatory barriers" — but this characterization appears to be a strategic misjudgment: NHS requires DTAC + MHRA Class 1 for formal procurement of documentation tools.
|
|
|
|
**Practical implication:** OE Visits **cannot be formally deployed in NHS settings** without completing DTAC and MHRA Class 1. Informal use by individual clinicians continues (OE is already being reviewed and discussed in UK clinical contexts), but NHS organizational procurement requires compliance that OE hasn't demonstrated.
|
|
|
|
### Finding 2: New Clinical Risk for OE in UK Markets — Corpus Mismatch (Previously Undocumented)
|
|
|
|
iatroX Clinical AI Insights (UK-focused clinical AI review) documents a failure mode for OE in UK clinical practice that is **distinct from** the four failure modes documented in Sessions 8-11:
|
|
|
|
- OE uses a **US-centric corpus**: cites AHA guidelines rather than NICE guidelines
|
|
- May suggest drugs **licensed in the US but not available in UK** (different BNF formulary)
|
|
- Dosing standards and treatment pathways may differ from UK clinical practice
|
|
- UK clinicians using OE may receive recommendations that are guideline-adherent for the US but not for the UK
|
|
|
|
This is not an LLM failure mode — it's a **data architecture mismatch**. The LLM may be accurate according to US evidence, but wrong for UK clinical practice. Relevant quote: "OE's UK-specific governance (DTAC/DCB) is not explicitly positioned on its public pages."
|
|
|
|
**This is a SIXTH distinct clinical AI risk for OE specifically, not just a fifth general LLM failure mode.** The corpus mismatch is potentially more immediately harmful than probabilistic LLM failure modes because it affects ALL recommendations in specific clinical areas (drug prescribing, guideline-concordant treatment).
|
|
|
|
### Finding 3: Fifth General LLM Clinical Failure Mode — The Real-World Deployment Gap
|
|
|
|
Oxford Internet Institute + Nuffield Dept. of Primary Care, published *Nature Medicine*, February 2026 (1,298 participants, randomized, preregistered):
|
|
|
|
- **LLMs alone:** 94.9% correct condition identification; 56.3% correct disposition
|
|
- **Participants using LLMs:** <34.5% correct condition; <44.2% correct disposition — **NO BETTER THAN CONTROL GROUP**
|
|
- A 60-percentage-point collapse between LLM isolated performance and user-assisted performance
|
|
|
|
Root cause: **"two-way communication breakdown"** — users didn't know what the LLM needed; responses mixed good and poor recommendations making it hard to extract correct action.
|
|
|
|
**Study conclusion:** "Just as clinical trials are required for medications, AI systems need rigorous testing with diverse, real users."
|
|
|
|
**Scope note:** This was PUBLIC use (general population), not physician use like OE. The mechanism may be weaker for trained physicians. But the finding is structural: benchmark performance is NOT a predictor of real-world user-assisted outcomes. The JMIR systematic review of 761 LLM evaluation studies confirms: only 5% used real patient care data; 95% used USMLE-style exam questions. The benchmark-to-reality gap is systematic.
|
|
|
|
**Five general LLM clinical failure modes now documented:**
|
|
1. Omission-reinforcement (NOHARM: 76.6% of severe errors are omissions)
|
|
2. Demographic bias amplification (Nature Medicine, JMIR e78132: systematic bias across care settings)
|
|
3. Automation bias robustness (NCT06963957: survives 20-hour training)
|
|
4. Medical misinformation propagation (Lancet DH: 32%/47% in clinical language)
|
|
5. **Real-world deployment gap (Oxford/Nature Medicine RCT: 60pp performance collapse in user interaction)**
|
|
|
|
**Six OE-specific risks (five above + corpus mismatch in non-US markets).**
|
|
|
|
### Finding 4: Counter-Evidence — Centaur Model Works Under Specific Conditions
|
|
|
|
*Cell Reports Medicine*, October 2025 (PMC12629785), 91 error scenarios across 16 clinical specialties:
|
|
|
|
- Pharmacist + LLM co-pilot: **61% accuracy**; **1.5x improvement for serious harm errors vs. pharmacist alone**
|
|
- Architecture: RAG (retrieval-augmented generation) from curated drug database — NOT parametric memory
|
|
|
|
**This is the best positive clinical AI safety evidence found across 12 sessions.** The centaur design CAN work, but under specific conditions:
|
|
1. Domain expert is ENGAGED and in co-pilot mode (not automation bias mode)
|
|
2. LLM uses RAG from curated database (reduces hallucination, corpus mismatch, misinformation propagation)
|
|
3. Task is STRUCTURED (medication safety review — not open-ended clinical reasoning)
|
|
|
|
**The conditions matter.** OE doesn't use this architecture: it's a general clinical reasoning tool, not a structured RAG safety checker. But the pharmacist+LLM co-pilot result provides the mechanistic proof that the centaur design can work — it requires design intentionality, not just human oversight.
|
|
|
|
### Finding 5: Belief 1 CONFIRMED AND STRENGTHENED — Post-1970 Cohort Mortality Deterioration
|
|
|
|
**PNAS 2026** (Abrams & Bramajo et al., UTMB, published March 9-10, 2026):
|
|
- Post-1970 cohorts: **increasing mortality in CVD, cancer, AND external causes** vs. predecessors — across ALL three cause groups simultaneously
|
|
- **A broad mortality deterioration beginning around 2010** affected **nearly every living adult cohort** — not just younger generations
|
|
- Projected: "**unprecedented longer-run stagnation, or even sustained decline**, in US life expectancy"
|
|
- Not a single-cause problem: "complex convergence of rising chronic disease, shifting behavioral risks, and increases in certain cancers among younger adults"
|
|
|
|
**Context:** CDC reports 2024 US life expectancy reached **79.0 years** (up 0.6 from 78.4 in 2023) — three consecutive years of post-COVID recovery. BUT the PNAS cohort analysis shows this surface improvement is a COVID/overdose recovery, not structural improvement. The cohort trajectory is worsening.
|
|
|
|
**The "2010 period effect" is the most significant new finding for Belief 1:** Something systemic changed around 2010 that made EVERY adult cohort simultaneously sicker. This is not a generational behavioral story — it's an environmental/systemic story. The 1950s birth cohort is the transition point from improvement to deterioration.
|
|
|
|
**Belief 1 disconfirmation result: FAILED.** The strongest candidate for disconfirmation (CDC's +0.6 year improvement) is surface noise over a deepening structural problem. The PNAS analysis provides the most comprehensive multi-cause confirmation of the compounding failure thesis to date.
|
|
|
|
### Finding 6: Regulatory Track — Four Mechanisms, Not Three
|
|
|
|
Session 11 identified THREE tracks (commercial, research, regulatory). Session 12 identifies **four**:
|
|
|
|
**Track 3A — EU AI Act (August 2026, European deployments):** Unchanged from Session 11. OE has made no compliance announcements for European markets.
|
|
|
|
**Track 3B — NHS Procurement (UK, operational now):** The supplier registry is the mechanism — 19 vendors compliant, OE absent. UK expansion requires DTAC + MHRA Class 1. This is OE's choice point.
|
|
|
|
**Track 4 — UK Parliamentary Scrutiny (March 2026, ongoing):** House of Lords Science and Technology Committee launched "Innovation in the NHS: Personalised Medicine and AI" inquiry on March 10, 2026. Written evidence deadline: April 20, 2026. Focus: why does the NHS struggle to adopt innovation, and what's blocking it? This is adoption-focused (opposite framing from EU AI Act's safety focus). If the inquiry recommends procurement reform that streamlines AI adoption, it could accelerate OE's NHS path — but would also require completing the governance requirements that streamlining doesn't eliminate.
|
|
|
|
### Finding 7: OBBBA Work Requirements — Implementation On Track
|
|
|
|
As of January 2026:
|
|
- 7 states with pending Section 1115 waivers (Arizona, Arkansas, Iowa, Montana, Ohio, South Carolina, Utah)
|
|
- Nebraska implementing via state plan amendment (without waiver) — ahead of federal mandate
|
|
- Federal mandate deadline: December 31, 2026 (with extension to 2028 available)
|
|
- Coverage loss effects begin: Q1 2027
|
|
|
|
This confirms Session 8's structural concern: VBC enrollment stability will be disrupted beginning Q1 2027. The BALANCE model's effectiveness under enrollment fragmentation is the key question for 2027.
|
|
|
|
---
|
|
|
|
## Synthesis
|
|
|
|
**The clinical AI safety picture after 12 sessions:**
|
|
|
|
The failure mode catalogue is now comprehensive:
|
|
- Five general LLM failure modes (vs. three when this thread started in Session 8)
|
|
- One OE-specific failure mode in non-US markets (corpus mismatch)
|
|
- One counter-evidence case for centaur design (pharmacist+RAG+structured task)
|
|
- One fundamental evaluation methodology problem (95% of studies use exam questions, not real patient data)
|
|
|
|
The regulatory track has four mechanisms, not three. The NHS supplier registry (operational) and Lords inquiry (adoption-focused) are the UK-specific mechanisms. The EU AI Act remains the largest-scale forcing function (August 2026). None of these mechanisms are yet producing OE safety disclosure.
|
|
|
|
**The centaur design insight from Session 12:** The pharmacist+LLM co-pilot result shows the design that would work: RAG architecture, domain expert as engaged co-pilot, structured safety task. OE's design (general clinical reasoning, physician as consumer not co-pilot) is architecturally different from the pharmacist+LLM model. The centaur isn't broken; OE isn't the centaur.
|
|
|
|
**Belief 1 after Session 12:** The keystone belief is more structurally grounded than it was before this session. The PNAS 2026 multi-cause cohort analysis is the strongest evidence Vida has encountered for the compounding failure thesis. The 2010 period effect (all cohorts deteriorating simultaneously) opens a new research direction: what systemic factor changed in 2010?
|
|
|
|
---
|
|
|
|
## Claim Candidates
|
|
|
|
CLAIM CANDIDATE 1: "US life expectancy stagnation is rooted in a post-1970 birth cohort mortality deterioration spanning cardiovascular disease, cancer, and external causes simultaneously — and a period-effect beginning around 2010 that deteriorated every living adult cohort — portending unprecedented longer-run stagnation or sustained decline (PNAS 2026)"
|
|
- Domain: health
|
|
- Confidence: proven (PNAS peer-reviewed, large n, 1979-2023 data, confirmed by companion PNAS forecast paper)
|
|
- Sources: PNAS doi: 10.1073/pnas.2519356123 (March 2026), UTMB newsroom
|
|
- KB connections: Strongest structural confirmation of Belief 1 compounding failure thesis; extends deaths-of-despair framing to include CVD and cancer cohort deterioration
|
|
|
|
CLAIM CANDIDATE 2: "LLMs achieve 94.9% clinical condition identification accuracy in isolation but participants using the same LLMs perform no better than control groups (<34.5%) — establishing a real-world deployment gap between LLM knowledge and user-assisted outcome improvement that is not predicted by benchmark performance (Nature Medicine RCT, 1,298 participants, Oxford 2026)"
|
|
- Domain: health, secondary: ai-alignment
|
|
- Confidence: proven (RCT, preregistered, 1,298 participants, three LLMs all showing same gap)
|
|
- Sources: Nature Medicine Vol 32 p. 609-615 (February 2026, Oxford)
|
|
- KB connections: Fifth distinct clinical AI failure mode; methodologically distinct from automation bias (different mechanism: user fails to extract correct guidance, not physician deferring to wrong guidance); paired with JMIR 95% benchmark evaluation finding
|
|
|
|
CLAIM CANDIDATE 3: "Pharmacist + LLM co-pilot using retrieval-augmented generation improves serious medication harm detection by 1.5x vs. pharmacist alone across 16 clinical specialties — evidence that the centaur model works under conditions of domain expert engagement, RAG architecture, and structured safety tasks (Cell Reports Medicine, October 2025)"
|
|
- Domain: health, secondary: ai-alignment
|
|
- Confidence: likely (prospective cross-over, 91 scenarios, 16 specialties, peer-reviewed Cell Press journal; RAG architecture constraint is key scope qualifier)
|
|
- Sources: Cell Reports Medicine doi: 10.1016/j.xcrm.2025.00396-9; PMC12629785
|
|
- KB connections: Counter-evidence to the pessimistic reading of Belief 5; establishes design conditions under which centaur succeeds vs. fails; contrasts with automation bias finding (NCT06963957) where centaur fails
|
|
|
|
CLAIM CANDIDATE 4: "OpenEvidence's US-centric clinical corpus creates a distinct category of harm in UK clinical practice — guideline mismatch with NICE recommendations, BNF formulary discrepancies, and off-license drug suggestions — independent of LLM failure modes and unaddressed by OE's absence of DTAC assessment or MHRA registration as of March 2026"
|
|
- Domain: health
|
|
- Confidence: proven (guideline corpus mismatch is documented; governance absence is documented fact; iatroX review is independent UK clinical assessment)
|
|
- Sources: iatrox.com review series 2025-2026; NHS DTAC guidance; MHRA medical device registration requirements
|
|
- KB connections: Sixth OE-specific clinical risk; extends the OE safety opacity thread from Sessions 8-11 into non-US markets; connects to NHS supplier registry absence
|
|
|
|
CLAIM CANDIDATE 5: "95% of clinical LLM evaluation studies assessed performance on medical examination questions rather than real patient care data — establishing a systematic evaluation methodology gap that makes USMLE-level benchmark performance uninterpretable as a clinical safety signal (JMIR systematic review, 761 studies, 39 benchmarks)"
|
|
- Domain: health, secondary: ai-alignment
|
|
- Confidence: proven (systematic review of 761 studies, peer-reviewed JMIR, PMC12706444)
|
|
- Sources: JMIR e84120 (2025); PMC12706444
|
|
- KB connections: Foundational methodology claim for the benchmark-to-reality gap; explains why OE's "100% USMLE" benchmark performance cited in Session 9 is not interpretable as a clinical safety signal; pairs with Oxford/Nature Medicine RCT as the empirical demonstration
|
|
|
|
---
|
|
|
|
## Disconfirmation Results
|
|
|
|
**Belief 1 (keystone — healthspan as binding constraint): NOT DISCONFIRMED. STRUCTURALLY STRENGTHENED.**
|
|
The strongest disconfirmation candidate (CDC 2024 LE recovery to 79.0 years) is surface noise over the structural deterioration documented in the PNAS cohort analysis. The compounding failure thesis is now supported by multi-cause, multi-cohort evidence spanning CVD, cancer, and external causes — not just deaths of despair.
|
|
|
|
**Belief 5 (clinical AI safety): NOT DISCONFIRMED. Failure mode catalogue extended to five (general) + one (OE-specific).**
|
|
Counter-evidence found (pharmacist+LLM co-pilot, Cell Reports Medicine): centaur design works under RAG+structured+expert-engaged conditions. This is meaningful — the design EXISTS that would work. OE's architecture differs from this design.
|
|
|
|
---
|
|
|
|
## Follow-up Directions
|
|
|
|
### Active Threads (continue next session)
|
|
|
|
- **PNAS "2010 period effect" — what systemic change explains the 2010 deterioration across all cohorts?** This is the most important unexplored question in the Belief 1 thread. ACA passage was 2010; opioid crisis peaked 2015-2016; social media became mass-market 2009-2012. Multiple candidate mechanisms. A targeted search for research on "what changed in 2010 in US mortality" could yield a new structural claim.
|
|
|
|
- **EU AI Act August 2026 — OE European compliance status:** Unchanged from Session 11. The five-month clock is now down to ~4.5 months. Watch for: any OE press release mentioning EU compliance, any European health system partnership that would trigger Annex III obligations.
|
|
|
|
- **Lords inquiry evidence submissions:** Written evidence deadline is April 20, 2026 — 27 days away. The submissions from NHS trusts, clinical AI companies, and researchers will be published on the Parliament website. This is potentially the richest multi-voice clinical AI governance document of 2026. Watch for OE's submission (if filed) or NHS trust perspectives on clinical AI safety barriers.
|
|
|
|
- **NCT07328815 (ensemble LLM confidence signals behavioral nudge trial):** Still no results. Continue watching.
|
|
|
|
- **OE UK expansion actual timeline:** The 2026 signal is there but no concrete UK product announcement. Watch for: (a) DTAC assessment filing by OE, (b) MHRA Class 1 registration by OE, (c) OE Visits being offered to NHS trusts.
|
|
|
|
### Dead Ends (don't re-run)
|
|
|
|
- **Tweet feeds:** Confirmed dead. Don't check.
|
|
- **OE-specific demographic bias evaluation:** Confirmed dead in Session 11. Don't re-run.
|
|
- **Big Tech GLP-1 adherence native platform:** Confirmed dead across Sessions 9-12. Don't re-run.
|
|
- **DTAC V2 April 6 as major compliance gate:** Confirmed this session that it's a form update, not a new substantive requirement. Don't re-frame this as a forcing function.
|
|
- **Canada semaglutide generics data:** Health Canada rejection (Dr. Reddy's) confirmed in Session 10. 2027 at earliest.
|
|
|
|
### Branching Points
|
|
|
|
- **2010 mortality deterioration — behavioral vs. structural cause:**
|
|
- Direction A: The 2010 period effect is primarily driven by opioid crisis and deaths of despair (behavioral) — which are beginning to stabilize as overdose deaths plateau. Implications: the period effect may be transient, and the Belief 1 compounding failure framing is stronger for the cohort effect (permanent) than the period effect (potentially reversing).
|
|
- Direction B: The 2010 period effect is systemic (ACA insurance disruption, great recession sequelae, metabolic disease epidemic acceleration, social isolation amplified by smartphone/social media) — structural rather than behavioral. Implications: the period effect continues and compounds with the cohort effect, accelerating projected decline.
|
|
- **Recommendation: Direction B seems more consistent with the multi-cause finding (CVD AND cancer AND external causes all deteriorating — not just overdose). A behavioral drug crisis would show up primarily in external causes; CVD and cancer deteriorating together suggests metabolic/systemic drivers.**
|
|
|
|
- **Lords inquiry impact — adoption vs. safety framing race in UK:**
|
|
- Direction A: The Lords inquiry focuses on adoption blockage and produces recommendations that streamline NHS AI procurement. Clinical AI adoption accelerates but safety requirements remain minimal (DTAC is the floor). Safety concerns documented in research continue to diverge from commercial deployment.
|
|
- Direction B: Evidence submissions to the Lords inquiry surface the clinical AI safety literature (NOHARM, Oxford RCT, Nature Medicine bias studies) and the inquiry expands its mandate to include safety governance recommendations. This would be the most consequential UK regulatory event for clinical AI safety since the NHS began digitizing.
|
|
- **Recommendation: Direction A is more likely given the inquiry's explicit framing ("why aren't we adopting faster?"). Direction B requires a compelling evidence submission that re-frames adoption failure as a safety feature, not a bug. Watch evidence submissions carefully.**
|