teleo-codex/agents/vida/musings/research-2026-03-23.md

---
status: seed
type: musing
stage: developing
created: 2026-03-23
last_updated: 2026-03-23
tags: [clinical-ai-safety, openevidence, sociodemographic-bias, multi-agent-ai, automation-bias, behavioral-nudges, eu-ai-act, nhs-dtac, llm-misinformation, regulatory-pressure, belief-5-disconfirmation, market-research-divergence]
---

# Research Session 11: OE-Specific Bias Evaluation, Multi-Agent Market Entry, and the Commercial-Research Divergence

## Research Question

**Has OpenEvidence been specifically evaluated for the sociodemographic biases documented across all LLMs in Nature Medicine 2025 — and are multi-agent clinical AI architectures (the NOHARM-proposed harm-reduction approach) entering the clinical market as a safety design?**

## Why This Question

**Session 10 (March 22) opened two Directions from Belief 5's expanded failure mode catalogue:**

- **Direction A (priority):** Search for OE-specific bias evaluation. The Nature Medicine study found systematic demographic bias in all 9 tested LLMs, but OE was not among them. An OE-specific evaluation would either (a) confirm the bias exists in OE or (b) provide the first counter-evidence to the reinforcement-as-bias-amplification mechanism.

- **Secondary active thread:** Are multi-agent clinical AI systems entering the market with the safety framing NOHARM recommends? (Multi-agent reduces harm by 8%.) If yes, the centaur model problem has a market-driven solution. If no, the gap between NOHARM evidence and market practice is itself a concerning observation.

**Disconfirmation target — Belief 5 (clinical AI safety):**
The strongest complication from Session 10: NOHARM shows best-in-class LLMs outperform generalist physicians on safety by 9.7%. If OE uses best-in-class models AND has undergone bias evaluation, the "reinforcement-as-bias-amplification" mechanism might be overstated.

**What would disconfirm the expanded Belief 5 concern:**
- OE-specific bias evaluation showing no demographic bias
- OE disclosure of NOHARM-benchmark model performance
- Multi-agent safety designs entering commercial market (which would make OE's single-agent architecture an addressable problem)
- Regulatory pressure forcing OE safety disclosure (shifts concern from "permanent gap" to "addressable regulatory problem")

## What I Found

### Core Finding 1: OE Has No Published Sociodemographic Bias Evaluation — Absence Is the Finding

Direction A from Session 10: Search for any OE-specific evaluation of sociodemographic bias in clinical recommendations.

**Result: No OE-specific bias evaluation exists.** Zero published or disclosed evaluation. OE's own documentation describes itself as providing "reliable, unbiased and validated medical information" — but this is marketing language, not evidence. The Wikipedia article and PMC review articles do not cite any bias evaluation methodology.

This absence is itself a finding of high KB value: OE operates at $12B valuation, 30M+ monthly consultations, with a recent EHR integration into Sutter Health (~12,000 physicians), and has published zero demographic bias assessment. The Nature Medicine finding (systematic demographic bias in ALL 9 tested LLMs, both proprietary and open-source) applies by inference — OE has not rebutted it with its own evaluation.

**New PMC article (PMC12951846, Philip & Kurian, 2026):** A 2026 review article describes OE as "reliable, unbiased and validated" — but provides no evidence for the "unbiased" claim. This is a citation risk: future work citing this review will inherit an unsupported "unbiased" characterization.

**Wiley + OE partnership (new, March 2026):** Wiley partnered with OE to deliver Wiley medical journal content at point of care. This expands OE's content licensing but does not address the model architecture transparency problem. More content sources do not change the fact that the underlying model's demographic bias has never been evaluated.

### Core Finding 2: OE's Model Architecture Remains Undisclosed — NOHARM Benchmark Unknown

**Search result:** No disclosure of OE's model architecture, training data, or NOHARM safety benchmark performance. OE's press releases describe their approach as "evidence-based" and sourced from NEJM, JAMA, Lancet, and now Wiley — but do not name the underlying language model, describe training methodology, or cite any clinical safety benchmark.

**Why this matters under the NOHARM framework:** The NOHARM study found that the BEST-performing models (Gemini 2.5 Flash, LiSA 1.0) produce severe errors in 11.8-14.6% of cases, while the WORST models (o4 mini, GPT-4o mini) produce severe errors in 39.9-40.1% of cases. Without knowing where OE's model falls in this spectrum, the 30M+/month consultation figure is uninterpretable from a safety standpoint. OE could be at the top of the safety distribution (below generalist physician baseline) or significantly below it — and neither physicians nor health systems can know.

**The Sutter Health integration raises the stakes:** OE is now embedded in Epic EHR at Sutter Health with "high standards for quality, safety and patient-centered care" (from Sutter's press release) — but no pre-deployment NOHARM evaluation was cited. An EHR-embedded tool with unknown safety benchmarks now operates in-context for ~12,000 physicians.

### Core Finding 3: Multi-Agent AI Entering Healthcare — But for EFFICIENCY, Not SAFETY

Mount Sinai study (npj Health Systems, published online March 9, 2026): "Orchestrated Multi-Agent AI Systems Outperform Single Agents in Health Care"
- Lead: Girish N. Nadkarni (Director, Hasso Plattner Institute for Digital Health, Icahn School of Medicine)
- Finding: Distributing healthcare AI tasks among specialized agents reduces computational demands by **65x** while maintaining performance as task volume scales
- Use cases demonstrated: finding patient information, extracting data, checking medication doses
- **Framing: EFFICIENCY AND SCALABILITY, not safety**

**The critical distinction from NOHARM:** The NOHARM paper showed multi-agent REDUCES CLINICAL HARM (8% harm reduction vs. solo model). The Mount Sinai study shows multi-agent is COMPUTATIONALLY EFFICIENT. These are different claims, but both point to multi-agent architecture as superior to single-agent. The market is deploying multi-agent for cost/scale reasons; the safety case from NOHARM is not yet driving commercial adoption.

This creates a meaningful KB finding: the first large-scale multi-agent clinical AI deployment (Mount Sinai demonstration) is framed around efficiency metrics, not harm reduction. The 8% harm reduction that NOHARM documents is not being operationalized as the primary market argument for multi-agent adoption.

**Separately, NCT07328815** (the follow-on behavioral nudges trial to NCT06963957) uses a novel multi-agent approach for a different purpose: generating ensemble confidence signals to flag low-confidence AI recommendations to physicians. Three LLMs (Claude Sonnet 4.5, Gemini 2.5 Pro Thinking, GPT-5.1) each rate the confidence of AI recommendations; the mean determines a color-coded signal. This is NOT multi-agent for clinical reasoning — it's multi-agent for UI signaling to reduce physician automation bias. It's the first concrete operationalized solution to the automation bias problem.

### Core Finding 4: Lancet Digital Health — LLMs Propagate Medical Misinformation 32% of the Time (47% in Clinical Note Format)

Mount Sinai (Eyal Klang et al.), published in The Lancet Digital Health, February 2026:
- 1M+ prompts across leading language models
- **Average propagation of medical misinformation: 32%**
- **When misinformation embedded in hospital discharge summary / clinical note format: 47%**
- Smaller/less advanced models: >60% propagation
- ChatGPT-4o: ~10% propagation
- Key mechanism: "AI systems treat confident medical language as true by default, even when it's clearly wrong"

**This is a FOURTH clinical AI safety failure mode**, distinct from:
1. Omission errors (NOHARM: 76.6% of severe errors are omissions)
2. Sociodemographic bias (Nature Medicine: demographic labels alter recommendations)
3. Automation bias (NCT06963957: physicians defer to erroneous AI even after AI-literacy training)
4. **Medical misinformation propagation (THIS FINDING: 32% average; 47% in clinical language)**

**Critical connection to OE specifically:** OE's use case is exactly the scenario where clinical language is most authoritative. Physicians query OE using clinical language; OE synthesizes medical literature. If OE encounters conflicting information (where one source contains an error presented in confident clinical language), the 47% propagation rate for clinical-note-format misinformation is directly applicable. This failure mode is particularly insidious because it's invisible to the physician: OE would confidently cite a "peer-reviewed source" containing the misinformation.

**Combined with the "reinforces plans" finding:** If a physician's query to OE contains a false assumption (stated confidently in clinical language), OE may accept the false premise and build a recommendation around it, then confirm the physician's existing (incorrect) plan. This is the omission-reinforcement mechanism combined with the misinformation propagation mechanism.

### Core Finding 5: JMIR Nursing Care Plan Bias — Extends Demographic Bias to Nursing Settings

JMIR e78132 (JMIR 2025, Volume 2025/1): "Detecting Sociodemographic Biases in the Content and Quality of Large Language Model–Generated Nursing Care: Cross-Sectional Simulation Study"
- 96 sociodemographic identity combinations tested (first such study for nursing)
- 9,600 GPT-generated nursing care plans analyzed
- **Finding: LLMs systematically reproduce sociodemographic biases in BOTH content AND expert-rated clinical quality of nursing care plans**
- Described as "first empirical evidence documenting these nuanced biases in nursing"

**KB value:** The Nature Medicine finding (demographic bias in physician clinical decisions) is now extended to a different care setting (nursing), a different AI platform (GPT vs. the 9 models in Nature Medicine), and a different care task (nursing care planning vs. emergency department triage). The bias is not specific to emergency medicine or physician decisions — it appears in planned, primary care nursing contexts too. This strengthens the inference that OE's model (whatever it is) likely shows similar demographic bias patterns.

### Core Finding 6: Regulatory Pressure Is Building — EU AI Act (August 2026) and NHS DTAC (April 2026)

**EU AI Act — August 2, 2026 compliance deadline:**
- Healthcare AI is classified as "high-risk" under Annex III
- Core obligations (effective August 2, 2026 for new deployments or significantly changed systems):
  1. **Risk management system** — ongoing throughout lifecycle
  2. **Human oversight** — mandatory, not optional; "meaningful" oversight requirement
  3. **Dataset documentation** — training data must be "well-documented, representative, and sufficient in quality"
  4. **EU database registration** — high-risk AI systems must be registered before deployment in Europe
  5. **Transparency to users** — instructions for use, limitations disclosed
- Full Annex III obligations (including manufacturer requirements): August 2, 2027

**NHS England DTAC Version 2 — April 6, 2026 deadline:**
- Published February 24, 2026
- Requires ALL digital health tools deployed in NHS to meet updated clinical safety and data protection standards
- Deadline: April 6, 2026 (two weeks from today)
- This is a MANDATORY requirement, not a voluntary standard

**Why this matters for the OE safety concern:**
- OE has expanded internationally (Wiley partnership suggests European reach)
- If OE is used in NHS settings (UK has strong clinical AI adoption) or European healthcare systems, NHS DTAC and EU AI Act compliance is required
- EU AI Act's "dataset documentation" and "transparency to users" requirements would effectively force OE to disclose training data governance and safety limitations
- The "meaningful human oversight" requirement directly addresses the automation bias problem — you can't satisfy "mandatory meaningful human oversight" while deploying EHR-embedded AI with no pre-deployment safety evaluation

**This is the most important STRUCTURAL finding of this session:** For the first time, there is an external regulatory mechanism (EU AI Act) that could force OE to do what the research literature has been asking for: disclose model architecture, conduct bias evaluation, and implement meaningful safety governance. The regulatory track is converging on the research track's concerns — but the effective date (August 2026) gives OE 5 months to come into compliance.

## Synthesis: The 2026 Commercial-Research-Regulatory Trifurcation

The clinical AI field in 2026 is operating on three parallel tracks that are NOT converging:

**Track 1 — Commercial deployment (no safety infrastructure):**
- OE: $12B, 30M+/month consultations, Sutter Health EHR integration, Wiley content expansion
- No NOHARM benchmark disclosure, no demographic bias evaluation, no model architecture transparency
- Framing: adoption metrics, physician satisfaction, content breadth

**Track 2 — Research safety evidence (accumulating, not adopted):**
- NOHARM: 22% severe error rate; 76.6% are omissions → confirmed
- Nature Medicine: demographic bias in all 9 tested LLMs → OE by inference
- NCT06963957: automation bias survives 20-hour AI-literacy training → confirmed
- Lancet Digital Health: 47% misinformation propagation in clinical language → new
- JMIR e78132: demographic bias in nursing care planning → extends the scope
- NCT07328815: ensemble LLM confidence signals as behavioral nudge → solution in trial
- Mount Sinai multi-agent: efficiency-framed multi-agent deployment → not safety-framed

**Track 3 — Regulatory pressure (arriving 2026):**
- NHS DTAC V2: mandatory clinical safety standard, April 6, 2026 (NOW)
- EU AI Act Annex III: healthcare AI high-risk, August 2, 2026 (5 months)
- NIST AI Agent Standards: agent identity/authorization/security (no healthcare guidance yet)
- EU AI Act obligations will require: risk management, meaningful human oversight, dataset transparency, EU database registration

**The meta-finding:** Commercial and research tracks have been DIVERGING for 3+ sessions. The regulatory track is the exogenous force that could close the gap — but the August 2026 deadline applies to European deployments. US deployments (OE's primary market) face no equivalent mandatory disclosure requirement as of March 2026. The centaur design that Belief 5 proposes requires REGULATORY PRESSURE to be implemented because market forces are not driving it.

## Claim Candidates

CLAIM CANDIDATE 1: "LLMs propagate medical misinformation 32% of the time on average and 47% when misinformation is presented in confident clinical language (hospital discharge summary format) — a failure mode distinct from omission errors and demographic bias that makes the OE 'reinforces plans' mechanism more dangerous when the physician's query contains false premises"
- Domain: health, secondary: ai-alignment
- Confidence: likely (1M+ prompt analysis published in Lancet Digital Health; 32%/47% figures are empirical; connection to OE is inference)
- Sources: Lancet Digital Health doi: PIIS2589-7500(25)00131-1 (February 2026, Mount Sinai); Euronews coverage February 10, 2026
- KB connections: Fourth distinct clinical AI safety failure mode; combines with NOHARM omission finding and OE "reinforces plans" (PMC12033599) to define a three-layer failure scenario; extends Belief 5's failure mode catalogue

CLAIM CANDIDATE 2: "OpenEvidence has disclosed no NOHARM safety benchmark, no demographic bias evaluation, and no model architecture details despite operating at $12B valuation, 30M+ monthly clinical consultations, and EHR embedding in Sutter Health — making its safety profile unmeasurable against the NOHARM framework that defines current state-of-the-art clinical AI safety evaluation"
- Domain: health, secondary: ai-alignment
- Confidence: proven (the absence of disclosure is documented fact; NOHARM exists and is applicable; the scale metrics are confirmed)
- Sources: OE announcements, Sutter Health press release, NOHARM study (arxiv 2512.01241), Wikipedia OE, PMC12951846
- KB connections: Connects to the "scale without evidence" finding from Session 8; extends the OE safety concern to the specific absence of NOHARM-benchmark disclosure; establishes the comparison standard for clinical AI safety evaluation

CLAIM CANDIDATE 3: "Multi-agent clinical AI architecture entered commercial healthcare deployment in March 2026 (Mount Sinai, npj Health Systems) framed as 65x computational efficiency improvement — not as the 8% harm reduction that the NOHARM study documented, revealing a gap between research safety framing and commercial adoption framing of the same architectural approach"
- Domain: health, secondary: ai-alignment
- Confidence: likely (Mount Sinai study is peer-reviewed; NOHARM multi-agent finding is peer-reviewed; the framing gap is inference from comparing the two)
- Sources: npj Health Systems (March 9, 2026, Mount Sinai); arxiv 2512.01241 (NOHARM); EurekAlert newsroom coverage March 2026
- KB connections: Extends the multi-agent discussion from NOHARM; creates a new KB node on the commercial-safety gap in multi-agent deployment framing

CLAIM CANDIDATE 4: "The EU AI Act's Annex III high-risk classification and August 2, 2026 compliance deadline imposes the first external regulatory requirement for healthcare AI to document training data, implement mandatory human oversight, register in an EU database, and disclose limitations — creating regulatory pressure for clinical AI safety transparency that market forces have not produced"
- Domain: health, secondary: ai-alignment
- Confidence: proven (EU AI Act text is law; August 2, 2026 deadline is documented; healthcare AI classification as high-risk is established in Annex III and Article 6)
- Sources: EU AI Act official text; Orrick EU AI Act Guide; educolifesciences.com compliance guide; Lancet Digital Health PIIS2589-7500(25)00131-1
- KB connections: New regulatory node for health KB; connects to the commercial-research-regulatory trifurcation meta-finding; creates the structural argument for why safety disclosure will eventually be forced in European markets

CLAIM CANDIDATE 5: "LLMs systematically produce sociodemographically biased nursing care plans — reproducing biases in both content and expert-rated clinical quality across 9,600 generated plans (96 identity combinations) — extending the Nature Medicine demographic bias finding from emergency department physician decisions to planned nursing care contexts"
- Domain: health, secondary: ai-alignment
- Confidence: proven (9,600 tests, peer-reviewed JMIR publication, 96 identity combinations)
- Sources: JMIR doi: 10.2196/78132 (2025, volume 2025/1)
- KB connections: Extends Nature Medicine (2025) demographic bias finding to a different care setting; strengthens the inference that OE's model has demographic bias (now two independent studies showing pervasive LLM demographic bias across care contexts)

CLAIM CANDIDATE 6: "The NCT07328815 behavioral nudges trial operationalizes the first concrete solution to physician-LLM automation bias through a dual mechanism: (1) anchoring cue showing ChatGPT's baseline accuracy before evaluation, (2) ensemble-LLM color-coded confidence signals (mean of Claude Sonnet 4.5, Gemini 2.5 Pro Thinking, GPT-5.1 ratings) to engage System 2 deliberation — making multi-agent architecture a UI-layer safety tool rather than a clinical reasoning architecture"
- Domain: health, secondary: ai-alignment
- Confidence: experimental (trial design is registered and methodologically sound; outcome is not yet published for NCT07328815; intervention design is novel and first of its kind)
- Sources: ClinicalTrials.gov NCT07328815; medRxiv 2025.08.23.25334280v1 (parent study NCT06963957)
- KB connections: First operationalized solution to automation bias documented in Sessions 9-10; the ensemble-LLM signal is a novel multi-agent safety design; connects to NOHARM multi-agent finding; extends Belief 5's "centaur design must address" framing with a concrete intervention design

## Disconfirmation Result: Belief 5 — NOT DISCONFIRMED; Fourth Failure Mode Added

**Target:** Does OE's model architecture or a specific bias evaluation provide counter-evidence to the reinforcement-as-bias-amplification mechanism? Does multi-agent architecture in the market address the centaur design failure?

**Search result:**
- No OE bias evaluation: **Direction A comes up empty** — the absence of disclosure is itself the finding. OE has produced no counter-evidence to the demographic bias inference.
- Multi-agent market deployment: **Efficiency-framed, not safety-framed.** The commercial market is NOT deploying multi-agent for the harm-reduction reasons NOHARM documents. The gap between research evidence and market practice is confirmed and named.
- **New failure mode (Lancet DH 2026):** Medical misinformation propagation (32% average; 47% in clinical language format) adds a fourth mechanism to the Belief 5 failure mode catalogue.

**Belief 5 assessment:**
The failure mode catalogue now has four distinct entries:
1. **Omission-reinforcement** (NOHARM): OE confirms plans with missing actions → omissions become fixed
2. **Demographic bias amplification** (Nature Medicine, JMIR e78132): OE's model likely carries systematic bias; reinforcing demographically biased plans at scale amplifies them
3. **Automation bias robustness** (NCT06963957): even AI-trained physicians defer to erroneous AI
4. **Medical misinformation propagation** (Lancet DH 2026): LLMs accept false claims in clinical language 47% of the time → physician queries containing false premises get confirmed

**Counter-evidence state:** The only counter-evidence to Belief 5 remains the NOHARM finding that best-in-class models outperform generalist physicians on safety by 9.7%. OE's model class is unknown, so this counter-evidence cannot be applied to OE specifically.

**Structural insight (new this session):** The regulatory track (EU AI Act August 2026, NHS DTAC April 2026) creates the first mechanism to close the gap. Market forces have not driven clinical AI safety disclosure — but regulatory requirements will force it in European markets within 5 months. For US markets, no equivalent mandatory disclosure mechanism exists as of March 2026.

## Belief Updates

**Belief 5 (clinical AI safety):** **CATALOGUE EXTENDED — fourth failure mode documented.**
The Lancet Digital Health misinformation propagation finding (32% average; 47% in clinical-note format) is a distinct mechanism from omissions (NOHARM), demographic bias (Nature Medicine), and automation bias (NCT06963957). The full failure mode set now requires all four entries for completeness.

**Belief 3 (structural misalignment):** **NEW REGULATORY DIMENSION.** The EU AI Act and NHS DTAC V2 show that regulatory pressure is beginning to fill the gap that market forces have left. This doesn't change the diagnosis (structural misalignment persists) but adds a new mechanism for correction: regulatory mandate rather than market incentive.

**Cross-session meta-pattern update:** The theory-practice gap has held for 11 sessions. This session adds a new dimension: a REGULATORY track is now arriving (separate from both commercial deployment and research evidence). The three tracks (commercial, research, regulatory) are not yet converging, but the regulatory track is the first external force that could bridge the gap between the research finding (OE needs safety evaluation) and the commercial practice (OE has none).

## Follow-up Directions

### Active Threads (continue next session)

- **EU AI Act August 2026 — OE European compliance status:** Five months to OE compliance in European markets. Watch for: (1) any OE announcement about EU AI Act compliance; (2) any European health system partnership announcement that would trigger Annex III obligations; (3) any OE disclosure of training data governance or risk management system. This is the single thread most likely to force the model transparency that the research literature has demanded.

- **NHS DTAC V2 April 6, 2026 deadline (NOW):** This deadline is 2 weeks away. If OE is used in NHS settings, compliance is required now. Watch for: any UK news of NHS hospitals using OE, any DTAC assessment of OE, any NHS digital health approval or rejection of OE tools.

- **NCT07328815 results:** The behavioral nudges trial (ensemble LLM confidence signals) is the most concrete solution to automation bias in the clinical AI space. Results are unknown. Watch for: any preprint or trial completion announcement.

- **Mount Sinai multi-agent efficiency → safety bridge:** The March 9 study frames multi-agent as efficiency. Will subsequent publications from the same group (Nadkarni et al.) or NOHARM authors bridge to safety framing? The conceptual bridge is short; the commercial motivation (65x cost reduction) is there. Watch for: follow-on publications framing multi-agent efficiency as also providing safety redundancy.

- **OE model transparency pressure:** The EU AI Act compliance clock and the accumulating research literature (four failure modes documented) create pressure for OE to disclose model architecture. Watch for: any OE press release, research partnership, or regulatory filing that mentions model specifics. The Wiley content partnership is commercial, not technical — it doesn't help.

### Dead Ends (don't re-run)

- **Tweet feeds:** Sessions 6-11 all confirm dead. Don't check.

- **Big Tech GLP-1 adherence search:** Session 9 confirmed no native platform. Session 11 found no new signals. Don't re-run until a product announcement emerges.

- **OE-specific bias evaluation search:** Direction A from Session 10 is now closed as a dead end — no study exists. The absence is documented. Don't re-run this search; instead, watch for EU AI Act forcing disclosure.

- **May 2026 Canada semaglutide data point:** Session 10 confirmed Health Canada rejected Dr. Reddy's application. Don't expect Canada data until mid-2027 at earliest.

### Branching Points

- **EU AI Act → OE transparency forcing function:**
  - Direction A: EU AI Act August 2026 forces OE to disclose model architecture, training data, and safety evaluation for European deployments — and OE publishes its first formal safety documentation. This would be the highest-value KB event in the clinical AI safety thread: finally knowing where OE sits on the NOHARM spectrum.
  - Direction B: OE Europe is a small enough share of revenue that compliance is handled through a lightweight process that doesn't produce meaningful safety disclosure. The August 2026 deadline arrives with minimal public transparency from OE.
  - **Recommendation: Watch (can't act until August 2026). But track any European health system partnership announcements from OE — they would trigger the compliance obligation.**

- **Multi-agent: efficiency framing vs. safety framing race:**
  - Direction A: Efficiency framing wins. Multi-agent is adopted for 65x cost reduction. Safety benefits are a secondary effect that materializes but is not measured.
  - Direction B: Safety framing catches up. NOHARM authors or ARISE publish a comparative analysis showing efficiency AND harm reduction as dual benefits — and health system procurement begins requiring multi-agent architecture.
  - **Recommendation: Direction A is more likely in the short term. Direction B requires a high-profile clinical AI safety incident to shift the framing. Watch for any reported adverse event associated with single-agent clinical AI — that's the trigger for the framing shift.**