AI Models Show High Final Diagnosis Accuracy but Poor Differential Diagnosis
Overview
A cross-sectional study evaluating 21 large language models (LLMs) found that while these models achieve high accuracy in final diagnoses, they consistently underperform in generating differential diagnoses. The newly developed PrIME-LLM metric revealed significant variability in reasoning tasks, highlighting a critical limitation in current AI clinical reasoning capabilities.
Background
Large language models are increasingly explored for clinical decision support, but traditional benchmarks focus mainly on final diagnosis accuracy, overlooking the stepwise reasoning process essential in clinical practice. Differential diagnosis involves maintaining and refining multiple possible conditions, a complex task that reflects clinical uncertainty. This study assessed LLMs across the full clinical workflow using standardized vignettes and a novel composite metric, PrIME-LLM, to better capture performance in differential diagnosis, diagnostic testing, management, and other reasoning tasks.
Data Highlights
| Model | PrIME-LLM Score | Differential Diagnosis Failure Rate | Final Diagnosis Failure Rate |
|---|---|---|---|
| Grok 4 | 0.78 | >80% | <40% |
| Gemini 1.5 Flash | 0.64 | >80% | <40% |
Traditional accuracy measures ranged from approximately 81% to 90% across models, but differential diagnosis failure rates exceeded 80%, contrasting with less than 40% failure in final diagnosis tasks.
Key Findings
- LLMs achieved high final diagnosis accuracy (81%-90%) but showed poor performance in differential diagnosis, with failure rates exceeding 80%.
- The PrIME-LLM metric revealed wider performance variability across reasoning tasks than traditional accuracy metrics.
- Diagnostic testing accuracy was intermediate, outperforming differential diagnosis but lagging behind final diagnosis.
- Reasoning-optimized models outperformed nonreasoning models overall, with Grok 4 scoring highest on PrIME-LLM.
- LLMs tended to prematurely converge on a single diagnosis rather than maintaining a differential, limiting their clinical reasoning fidelity.
- Multimodal image-capable models showed some accuracy improvements on image-based questions, but text-only performance remained more consistent.
Clinical Implications
These findings underscore the current limitations of off-the-shelf LLMs in replicating the nuanced clinical reasoning process, particularly in generating and refining differential diagnoses. Clinicians should exercise caution when integrating AI tools into diagnostic workflows, recognizing that these models may provide accurate final answers but lack the reasoning transparency and uncertainty management essential for safe clinical decision-making.
Conclusion
Despite advances and reasoning optimizations, current LLMs fall short in differential diagnosis and comprehensive clinical reasoning, indicating they remain adjunct tools requiring physician oversight rather than autonomous decision-makers.
References
- JAMA Network Open Original Investigation -- AI Falls Short on Differential Dx
- Tordjman M, Mei X. Invited Commentary, Icahn School of Medicine at Mount Sinai
This content is an AI-generated, fully rewritten summary based on a published scholarly article. It does not reproduce the original text and is not a substitute for the original publication. Readers are encouraged to consult the source for full context, data, and methodology.