teleo-codex/inbox/null-result/2026-03-01-ai-degrades-human-performance-high-stakes.md
Teleo Agents 6459163781 epimetheus: source archive restructure — 537 files reorganized
inbox/queue/ (52 unprocessed) — landing zone for new sources
inbox/archive/{domain}/ (311 processed) — organized by domain
inbox/null-result/ (174) — reviewed, nothing extractable

One-time atomic migration. All paths preserved (wiki links use stems).

Pentagon-Agent: Epimetheus <968B2991-E2DF-4006-B962-F5B0A0CC8ACA>
2026-03-18 11:52:23 +00:00

4.7 KiB

type title author url date domain secondary_domains format status priority triage_tag tags flagged_for_vida processed_by processed_date extraction_model extraction_notes
source How AI Can Degrade Human Performance in High-Stakes Settings AI Frontiers https://ai-frontiers.org/articles/how-ai-can-degrade-human-performance-in-high-stakes-settings 2026-03-01 ai-alignment
health
essay null-result high claim
human-ai-performance
high-stakes
degradation
nursing
aviation
nuclear
joint-activity-testing
450 nursing students/nurses tested with AI in ICU cases — performance degrades 96-120% when AI predictions mislead
theseus 2026-03-18 anthropic/claude-sonnet-4.5 LLM returned 3 claims, 3 rejected by validator

Content

Cross-domain analysis of how AI degrades human performance in critical settings:

Healthcare (nursing study):

  • 450 nursing students and licensed nurses reviewing ICU cases
  • Four AI configurations from no assistance to full predictions + annotations
  • Best case: 53-67% BETTER when AI predictions accurate
  • Worst case: 96-120% WORSE when AI predictions misleading
  • "Nurses did not reliably recognize when AI predictions were right or wrong"
  • AI appeared to change HOW nurses think when assessing patients, not just what they decide

Aviation:

  • AI weather monitoring missed microbursts during landing
  • Crews faced doubled workload with halved preparation time
  • Required emergency maneuvers

Nuclear energy:

  • AI warning systems hid underlying problems through filtering
  • Misclassified gradual coolant pressure drops as benign
  • Led to cascading subsystem failures

Asymmetric risk profile:

  • Gains from accurate AI: 53-67%
  • Losses from inaccurate AI: 96-120%
  • "Averaging results can hide rare but severe errors, creating blind spots with potentially catastrophic consequences"

Conditions worsening degradation:

  1. AI errors are subtle and plausible (not obviously wrong)
  2. Humans cannot verify predictions (complexity/information asymmetry)
  3. AI aggregates/filters information, hiding important signals
  4. Staffing reduced based on false confidence in AI
  5. Rare but critical failures that testing didn't anticipate

Proposed mitigation — Joint Activity Testing (JAT):

  1. Test humans AND AI together, not separately
  2. Evaluate diverse AI performance scenarios (excel, struggle, fail)
  3. Enable human error recovery over patching

Agent Notes

Triage: [CLAIM] — "AI degrades human decision-making performance asymmetrically — gains from accurate AI (53-67%) are smaller than losses from inaccurate AI (96-120%) — creating a structural risk where average performance masks catastrophic tail outcomes" — multi-domain evidence Why this matters: The ASYMMETRY is the critical finding. Even if AI is right 90% of the time, the 10% where it's wrong produces losses nearly double the gains from the 90%. This is why averaging performance hides the real risk. For alignment: human oversight of AI is not just "sometimes unhelpful" — it's structurally asymmetric, with large downside when oversight fails and modest upside when it succeeds. What surprised me: The COGNITIVE CHANGE mechanism. AI doesn't just provide wrong answers — it changes how humans THINK about problems. This is deeper than automation bias. It's cognitive restructuring. Once you've internalized AI-mediated reasoning, you can't just "turn it off" when AI fails. KB connections: human-in-the-loop clinical AI degrades to worse-than-AI-alone, AI capability and reliability are independent dimensions, scalable oversight degrades rapidly as capability gaps grow Extraction hints: Three distinct claims: (1) asymmetric risk profile, (2) cognitive restructuring mechanism, (3) JAT as evaluation framework. The asymmetry finding is most novel.

Curator Notes

PRIMARY CONNECTION: human-in-the-loop clinical AI degrades to worse-than-AI-alone because physicians both de-skill from reliance and introduce errors when overriding correct outputs WHY ARCHIVED: Extends our existing clinical AI degradation claim with cross-domain evidence (nursing, aviation, nuclear) and quantifies the asymmetric risk profile. The cognitive restructuring mechanism is a novel finding.

Key Facts

  • 450 nursing students and licensed nurses participated in ICU case review study with four AI configurations
  • AI weather monitoring in aviation missed microbursts during landing, doubling crew workload and halving preparation time
  • Nuclear energy AI warning systems misclassified gradual coolant pressure drops as benign, leading to cascading subsystem failures
  • Study tested four AI configurations: no assistance, predictions only, predictions plus annotations, and full AI support