Frontier artificial intelligence models produced accurate final diagnoses but struggled with differential diagnosis in stepwise clinical scenarios, according to a cross-sectional study published in JAMA Network Open.
Researchers evaluated 21 off-the-shelf large language models (LLMs) using 29 standardized clinical vignettes and a newly developed composite metric, the Proportional Index of Medical Evaluation for LLMs (PrIME-LLM), designed to assess performance across the full clinical workflow—an area not captured by traditional benchmarks.
The models—including GPT-5, Claude 4.5 Opus, Gemini 3.0 Flash and Pro, and Grok 4—were tested sequentially across five domains: differential diagnosis, diagnostic testing, final diagnosis, management, and miscellaneous clinical reasoning tasks.
PrIME-LLM scores ranged from 0.64 for Gemini 1.5 Flash to 0.78 for Grok 4, with reasoning-optimized models outperforming nonreasoning models overall. Traditional accuracy measures were high and tightly clustered—between about 81% and 90%—but the PrIME-LLM framework revealed wider differences in performance across reasoning tasks.
Differential diagnosis was consistently the weakest domain, with failure rates exceeding 80% across models, compared with less than 40% for final diagnosis tasks. Diagnostic testing performance fell between these extremes, while management and other reasoning tasks showed intermediate accuracy.
Item-level analyses showed that final diagnosis tasks were more accurate than both diagnostic testing and differential diagnosis tasks across nearly all models, with differences of roughly 12% to 16% compared with diagnostic testing; diagnostic testing also outperformed differential diagnosis.
The findings highlight a key limitation in how models process clinical uncertainty. While clinicians typically maintain and refine a differential diagnosis over time, LLMs tended to converge prematurely on a single answer, the researchers noted.
Multimodal performance was mixed. Among 18 models with image capabilities, several—including GPT-4.5, Gemini 3.0 Pro, and Grok 4—showed improved accuracy on image-based questions, although gains were not consistent across all models and text-only performance remained more stable overall.
The study included triplicate evaluation of each vignette to account for variability, yielding more than 16,000 total responses. Models were tested using stepwise scenarios derived from the MSD Manual and scored by medical student evaluators using a standardized rubric. Optional features such as web search and retrieval tools were disabled to ensure consistent comparisons across models.
The researchers noted several limitations, including the use of publicly available clinical vignettes, which may have been included in model training, and the exclusion of augmented tools such as retrieval systems or guideline integration that could improve real-world performance.
In an accompanying invited commentary, Mickael Tordjman, MD, MS, and Xueyan Mei, PhD, both of the Icahn School of Medicine at Mount Sinai in New York, said the findings raise concerns about how LLMs are evaluated and deployed in clinical settings.
“The observation … that models may arrive at the correct final diagnosis while failing to construct a coherent differential supports this concern,” they wrote.
They added that hallucinations and variability—where identical prompts can yield different responses—remain inherent limitations of current model architectures.
“Even the most advanced reasoning models struggled to generate appropriate differential diagnoses,” they wrote, noting that effective diagnostic reasoning requires weighing probabilities and uncertainty rather than retrieving isolated facts.
The commentary also emphasized that current benchmarks may overestimate clinical readiness by focusing on final answers rather than the reasoning process and called for evaluation methods that better reflect real-world clinical decision-making.
The study findings “suggest that despite version-based improvements and advantages in reasoning-optimized models, off-the-shelf LLMs have not yet achieved the intelligence required for safe deployment and remain limited in demonstrating advanced clinical reasoning,” the researchers wrote.
The commentators concluded that these tools should remain carefully supervised adjuncts, reinforcing that physicians—not artificial intelligence—remain the primary decision-makers in clinical care.
Disclosures: Adam B. Landman, MD, reported receiving consulting fees from Abbott for work with its Medical Device Cybersecurity Council outside the submitted study. No other disclosures were reported. The study was supported in part by a National Institute of General Medical Sciences training award.
Source: JAMA Network Open Original Investigation, Invited Commentary