From acb40271cadbdda57763cc8fcf06a9861fb9462d Mon Sep 17 00:00:00 2001 From: Teleo Agents Date: Sun, 22 Mar 2026 04:18:09 +0000 Subject: [PATCH] extract: 2026-03-22-nature-medicine-llm-sociodemographic-bias Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70> --- ...of US physicians daily within two years.md | 6 ++++ ... errors when overriding correct outputs.md | 6 ++++ ...iagnostic accuracy in randomized trials.md | 6 ++++ ...re-medicine-llm-sociodemographic-bias.json | 32 +++++++++++++++++++ ...ture-medicine-llm-sociodemographic-bias.md | 16 +++++++++- 5 files changed, 65 insertions(+), 1 deletion(-) create mode 100644 inbox/queue/.extraction-debug/2026-03-22-nature-medicine-llm-sociodemographic-bias.json diff --git a/domains/health/OpenEvidence became the fastest-adopted clinical technology in history reaching 40 percent of US physicians daily within two years.md b/domains/health/OpenEvidence became the fastest-adopted clinical technology in history reaching 40 percent of US physicians daily within two years.md index 9334c8dcc..2a2190b5c 100644 --- a/domains/health/OpenEvidence became the fastest-adopted clinical technology in history reaching 40 percent of US physicians daily within two years.md +++ b/domains/health/OpenEvidence became the fastest-adopted clinical technology in history reaching 40 percent of US physicians daily within two years.md @@ -41,6 +41,12 @@ OpenEvidence reached 30M+ monthly consultations by March 2026, including a histo ARISE report reframes OpenEvidence adoption as shadow-IT workaround behavior rather than validation of clinical value. Clinicians use OE to 'bypass slow internal IT systems' because institutional tools are too slow for clinical workflows. This suggests rapid adoption reflects institutional system failure, not OE's clinical superiority. +### Additional Evidence (challenge) +*Source: [[2026-03-22-nature-medicine-llm-sociodemographic-bias]] | Added: 2026-03-22* + +Nature Medicine 2025 found systematic sociodemographic bias in LLM clinical recommendations across all model types, with LGBTQIA+ cases receiving mental health referrals 6-7x more often than clinically indicated and income determining imaging access (P < 0.001). If OpenEvidence "reinforces physician plans" at 30M+ monthly consultations, and those plans already contain demographic biases, OE may be amplifying rather than reducing healthcare inequities at unprecedented scale. + + Relevant Notes: diff --git a/domains/health/human-in-the-loop clinical AI degrades to worse-than-AI-alone because physicians both de-skill from reliance and introduce errors when overriding correct outputs.md b/domains/health/human-in-the-loop clinical AI degrades to worse-than-AI-alone because physicians both de-skill from reliance and introduce errors when overriding correct outputs.md index 48a0da2a4..36bb035e5 100644 --- a/domains/health/human-in-the-loop clinical AI degrades to worse-than-AI-alone because physicians both de-skill from reliance and introduce errors when overriding correct outputs.md +++ b/domains/health/human-in-the-loop clinical AI degrades to worse-than-AI-alone because physicians both de-skill from reliance and introduce errors when overriding correct outputs.md @@ -33,6 +33,12 @@ OpenEvidence's 1M daily consultations (30M+/month) with 44% of physicians expres --- +### Additional Evidence (extend) +*Source: [[2026-03-22-nature-medicine-llm-sociodemographic-bias]] | Added: 2026-03-22* + +Nature Medicine 2025 adds a third failure mode: even when physicians correctly use AI recommendations, those recommendations may encode systematic demographic biases (6-7x LGBTQIA+ mental health referrals, income-stratified imaging access) that physicians cannot detect because the bias is embedded in the model's training data, not visible in individual outputs. + + Relevant Notes: - [[centaur team performance depends on role complementarity not mere human-AI combination]] -- the chess centaur model does NOT generalize to clinical medicine where physician overrides degrade AI performance - [[medical LLM benchmark performance does not translate to clinical impact because physicians with and without AI access achieve similar diagnostic accuracy in randomized trials]] -- the multi-hospital RCT found similar diagnostic accuracy with/without AI; the Stanford/Harvard study found AI alone dramatically superior diff --git a/domains/health/medical LLM benchmark performance does not translate to clinical impact because physicians with and without AI access achieve similar diagnostic accuracy in randomized trials.md b/domains/health/medical LLM benchmark performance does not translate to clinical impact because physicians with and without AI access achieve similar diagnostic accuracy in randomized trials.md index 6c4e105c9..eb3d7d4a4 100644 --- a/domains/health/medical LLM benchmark performance does not translate to clinical impact because physicians with and without AI access achieve similar diagnostic accuracy in randomized trials.md +++ b/domains/health/medical LLM benchmark performance does not translate to clinical impact because physicians with and without AI access achieve similar diagnostic accuracy in randomized trials.md @@ -35,6 +35,12 @@ OpenEvidence's medRxiv preprint (November 2025) showed 24% accuracy for relevant ARISE report identifies specific failure modes: real-world performance 'breaks down when systems must manage uncertainty, incomplete information, or multi-step workflows.' This provides mechanistic detail for why benchmark performance doesn't translate — benchmarks test pattern recognition on complete data while clinical care requires uncertainty management. +### Additional Evidence (extend) +*Source: [[2026-03-22-nature-medicine-llm-sociodemographic-bias]] | Added: 2026-03-22* + +Nature Medicine 2025 study (1.7M outputs, 9 LLMs) shows that even when LLMs produce clinically accurate recommendations, they systematically vary those recommendations based on demographic framing (race, income, LGBTQIA+ status) in ways not supported by clinical guidelines. This extends the benchmark-to-impact gap: models can be diagnostically accurate on average while still producing demographically biased recommendations that worsen health equity. + + Relevant Notes: diff --git a/inbox/queue/.extraction-debug/2026-03-22-nature-medicine-llm-sociodemographic-bias.json b/inbox/queue/.extraction-debug/2026-03-22-nature-medicine-llm-sociodemographic-bias.json new file mode 100644 index 000000000..cfe8007f1 --- /dev/null +++ b/inbox/queue/.extraction-debug/2026-03-22-nature-medicine-llm-sociodemographic-bias.json @@ -0,0 +1,32 @@ +{ + "rejected_claims": [ + { + "filename": "llm-clinical-recommendations-show-systematic-sociodemographic-bias-across-all-model-types.md", + "issues": [ + "missing_attribution_extractor" + ] + }, + { + "filename": "llm-demographic-framing-effects-reveal-training-data-encodes-healthcare-rationing-patterns.md", + "issues": [ + "missing_attribution_extractor" + ] + } + ], + "validation_stats": { + "total": 2, + "kept": 0, + "fixed": 2, + "rejected": 2, + "fixes_applied": [ + "llm-clinical-recommendations-show-systematic-sociodemographic-bias-across-all-model-types.md:set_created:2026-03-22", + "llm-demographic-framing-effects-reveal-training-data-encodes-healthcare-rationing-patterns.md:set_created:2026-03-22" + ], + "rejections": [ + "llm-clinical-recommendations-show-systematic-sociodemographic-bias-across-all-model-types.md:missing_attribution_extractor", + "llm-demographic-framing-effects-reveal-training-data-encodes-healthcare-rationing-patterns.md:missing_attribution_extractor" + ] + }, + "model": "anthropic/claude-sonnet-4.5", + "date": "2026-03-22" +} \ No newline at end of file diff --git a/inbox/queue/2026-03-22-nature-medicine-llm-sociodemographic-bias.md b/inbox/queue/2026-03-22-nature-medicine-llm-sociodemographic-bias.md index b212e9efb..7b5908946 100644 --- a/inbox/queue/2026-03-22-nature-medicine-llm-sociodemographic-bias.md +++ b/inbox/queue/2026-03-22-nature-medicine-llm-sociodemographic-bias.md @@ -7,9 +7,13 @@ date: 2025-01-01 domain: health secondary_domains: [ai-alignment] format: research paper -status: unprocessed +status: enrichment priority: high tags: [llm-bias, sociodemographic-bias, clinical-ai-safety, race-bias, income-bias, lgbtq-bias, health-equity, medical-ai, nature-medicine] +processed_by: vida +processed_date: 2026-03-22 +enrichments_applied: ["medical LLM benchmark performance does not translate to clinical impact because physicians with and without AI access achieve similar diagnostic accuracy in randomized trials.md", "OpenEvidence became the fastest-adopted clinical technology in history reaching 40 percent of US physicians daily within two years.md", "human-in-the-loop clinical AI degrades to worse-than-AI-alone because physicians both de-skill from reliance and introduce errors when overriding correct outputs.md"] +extraction_model: "anthropic/claude-sonnet-4.5" --- ## Content @@ -54,3 +58,13 @@ Coverage: Nature Medicine, PubMed, Inside Precision Medicine (ChatBIAS study cov PRIMARY CONNECTION: "clinical AI augments physicians but creates novel safety risks requiring centaur design" (Belief 5 supporting claim) WHY ARCHIVED: First large-scale empirical proof that LLM clinical AI has systematic sociodemographic bias, found across all model types — this makes the "OE reinforces plans" safety concern concrete and quantifiable EXTRACTION HINT: Extract the demographic bias finding as its own claim, separate from the general "clinical AI safety" framing. The 6-7x LGBTQIA+ mental health referral rate and income-driven imaging disparity are specific enough to disagree with and verify. + + +## Key Facts +- Study analyzed 1.7 million LLM-generated outputs from 9 different models +- 1,000 emergency department cases (500 real, 500 synthetic) each presented in 32 sociodemographic variations +- LGBTQIA+ subgroups received mental health assessment recommendations approximately 6-7 times more often than clinically indicated +- High-income cases received significantly more CT/MRI recommendations (P < 0.001) compared to low/middle-income cases +- Published in Nature Medicine 2025, PubMed ID 40195448 +- Bias found in both proprietary and open-source models +- Study covered by Inside Precision Medicine, UCSF Coordinating Center for Diagnostic Excellence, Conexiant