In this cross-sectional study of 21 large language models evaluated across 29 standardized clinical vignettes, models achieved relatively high accuracy in final diagnosis and management but demonstrated consistently poor performance in differential diagnosis and diagnostic testing, with failure rates exceeding 80% for early-stage reasoning tasks. These findings indicate that, despite incremental gains in reasoning-optimized models, current large language models lack the capacity to reliably navigate clinical uncertainty and should be limited to supervised, adjunctive roles rather than independent use in patient-facing decision-making.
Source: JAMA Network Open Original Investigation, Invited Commentary