extract: 2026-03-22-nature-medicine-llm-sociodemographic-bias #1626

Closed
leo wants to merge 1 commit from extract/2026-03-22-nature-medicine-llm-sociodemographic-bias into main
5 changed files with 65 additions and 1 deletions

View file

@ -41,6 +41,12 @@ OpenEvidence reached 30M+ monthly consultations by March 2026, including a histo
ARISE report reframes OpenEvidence adoption as shadow-IT workaround behavior rather than validation of clinical value. Clinicians use OE to 'bypass slow internal IT systems' because institutional tools are too slow for clinical workflows. This suggests rapid adoption reflects institutional system failure, not OE's clinical superiority.
### Additional Evidence (challenge)
*Source: [[2026-03-22-nature-medicine-llm-sociodemographic-bias]] | Added: 2026-03-22*
Nature Medicine 2025 found systematic sociodemographic bias in LLM clinical recommendations across all model types, with LGBTQIA+ cases receiving mental health referrals 6-7x more often than clinically indicated and income determining imaging access (P < 0.001). If OpenEvidence "reinforces physician plans" at 30M+ monthly consultations, and those plans already contain demographic biases, OE may be amplifying rather than reducing healthcare inequities at unprecedented scale.
Relevant Notes:

View file

@ -33,6 +33,12 @@ OpenEvidence's 1M daily consultations (30M+/month) with 44% of physicians expres
---
### Additional Evidence (extend)
*Source: [[2026-03-22-nature-medicine-llm-sociodemographic-bias]] | Added: 2026-03-22*
Nature Medicine 2025 adds a third failure mode: even when physicians correctly use AI recommendations, those recommendations may encode systematic demographic biases (6-7x LGBTQIA+ mental health referrals, income-stratified imaging access) that physicians cannot detect because the bias is embedded in the model's training data, not visible in individual outputs.
Relevant Notes:
- [[centaur team performance depends on role complementarity not mere human-AI combination]] -- the chess centaur model does NOT generalize to clinical medicine where physician overrides degrade AI performance
- [[medical LLM benchmark performance does not translate to clinical impact because physicians with and without AI access achieve similar diagnostic accuracy in randomized trials]] -- the multi-hospital RCT found similar diagnostic accuracy with/without AI; the Stanford/Harvard study found AI alone dramatically superior

View file

@ -35,6 +35,12 @@ OpenEvidence's medRxiv preprint (November 2025) showed 24% accuracy for relevant
ARISE report identifies specific failure modes: real-world performance 'breaks down when systems must manage uncertainty, incomplete information, or multi-step workflows.' This provides mechanistic detail for why benchmark performance doesn't translate — benchmarks test pattern recognition on complete data while clinical care requires uncertainty management.
### Additional Evidence (extend)
*Source: [[2026-03-22-nature-medicine-llm-sociodemographic-bias]] | Added: 2026-03-22*
Nature Medicine 2025 study (1.7M outputs, 9 LLMs) shows that even when LLMs produce clinically accurate recommendations, they systematically vary those recommendations based on demographic framing (race, income, LGBTQIA+ status) in ways not supported by clinical guidelines. This extends the benchmark-to-impact gap: models can be diagnostically accurate on average while still producing demographically biased recommendations that worsen health equity.
Relevant Notes:

View file

@ -0,0 +1,32 @@
{
"rejected_claims": [
{
"filename": "llm-clinical-recommendations-show-systematic-sociodemographic-bias-across-all-model-types.md",
"issues": [
"missing_attribution_extractor"
]
},
{
"filename": "llm-demographic-framing-effects-reveal-training-data-encodes-healthcare-rationing-patterns.md",
"issues": [
"missing_attribution_extractor"
]
}
],
"validation_stats": {
"total": 2,
"kept": 0,
"fixed": 2,
"rejected": 2,
"fixes_applied": [
"llm-clinical-recommendations-show-systematic-sociodemographic-bias-across-all-model-types.md:set_created:2026-03-22",
"llm-demographic-framing-effects-reveal-training-data-encodes-healthcare-rationing-patterns.md:set_created:2026-03-22"
],
"rejections": [
"llm-clinical-recommendations-show-systematic-sociodemographic-bias-across-all-model-types.md:missing_attribution_extractor",
"llm-demographic-framing-effects-reveal-training-data-encodes-healthcare-rationing-patterns.md:missing_attribution_extractor"
]
},
"model": "anthropic/claude-sonnet-4.5",
"date": "2026-03-22"
}

View file

@ -7,9 +7,13 @@ date: 2025-01-01
domain: health
secondary_domains: [ai-alignment]
format: research paper
status: unprocessed
status: enrichment
priority: high
tags: [llm-bias, sociodemographic-bias, clinical-ai-safety, race-bias, income-bias, lgbtq-bias, health-equity, medical-ai, nature-medicine]
processed_by: vida
processed_date: 2026-03-22
enrichments_applied: ["medical LLM benchmark performance does not translate to clinical impact because physicians with and without AI access achieve similar diagnostic accuracy in randomized trials.md", "OpenEvidence became the fastest-adopted clinical technology in history reaching 40 percent of US physicians daily within two years.md", "human-in-the-loop clinical AI degrades to worse-than-AI-alone because physicians both de-skill from reliance and introduce errors when overriding correct outputs.md"]
extraction_model: "anthropic/claude-sonnet-4.5"
---
## Content
@ -54,3 +58,13 @@ Coverage: Nature Medicine, PubMed, Inside Precision Medicine (ChatBIAS study cov
PRIMARY CONNECTION: "clinical AI augments physicians but creates novel safety risks requiring centaur design" (Belief 5 supporting claim)
WHY ARCHIVED: First large-scale empirical proof that LLM clinical AI has systematic sociodemographic bias, found across all model types — this makes the "OE reinforces plans" safety concern concrete and quantifiable
EXTRACTION HINT: Extract the demographic bias finding as its own claim, separate from the general "clinical AI safety" framing. The 6-7x LGBTQIA+ mental health referral rate and income-driven imaging disparity are specific enough to disagree with and verify.
## Key Facts
- Study analyzed 1.7 million LLM-generated outputs from 9 different models
- 1,000 emergency department cases (500 real, 500 synthetic) each presented in 32 sociodemographic variations
- LGBTQIA+ subgroups received mental health assessment recommendations approximately 6-7 times more often than clinically indicated
- High-income cases received significantly more CT/MRI recommendations (P < 0.001) compared to low/middle-income cases
- Published in Nature Medicine 2025, PubMed ID 40195448
- Bias found in both proprietary and open-source models
- Study covered by Inside Precision Medicine, UCSF Coordinating Center for Diagnostic Excellence, Conexiant