AI Falls Short on Differential Dx

New PrIME-LLM benchmark shows strong diagnostic accuracy but persistent gaps in clinical reasoning across 21 large language models

Edited Kathryn Wighton

Conexiant April 13, 2026

AI Models Show High Final Diagnosis Accuracy but Poor Differential Diagnosis

Overview

A cross-sectional study evaluating 21 large language models (LLMs) found that while these models achieve high accuracy in final diagnoses, they consistently underperform in generating differential diagnoses. The newly developed PrIME-LLM metric revealed significant variability in reasoning tasks, highlighting a critical limitation in current AI clinical reasoning capabilities.

Background

Large language models are increasingly explored for clinical decision support, but traditional benchmarks focus mainly on final diagnosis accuracy, overlooking the stepwise reasoning process essential in clinical practice. Differential diagnosis involves maintaining and refining multiple possible conditions, a complex task that reflects clinical uncertainty. This study assessed LLMs across the full clinical workflow using standardized vignettes and a novel composite metric, PrIME-LLM, to better capture performance in differential diagnosis, diagnostic testing, management, and other reasoning tasks.

Data Highlights

Model	PrIME-LLM Score	Differential Diagnosis Failure Rate	Final Diagnosis Failure Rate
Grok 4	0.78	>80%	<40%
Gemini 1.5 Flash	0.64	>80%	<40%

Traditional accuracy measures ranged from approximately 81% to 90% across models, but differential diagnosis failure rates exceeded 80%, contrasting with less than 40% failure in final diagnosis tasks.

Key Findings

LLMs achieved high final diagnosis accuracy (81%-90%) but showed poor performance in differential diagnosis, with failure rates exceeding 80%.
The PrIME-LLM metric revealed wider performance variability across reasoning tasks than traditional accuracy metrics.
Diagnostic testing accuracy was intermediate, outperforming differential diagnosis but lagging behind final diagnosis.
Reasoning-optimized models outperformed nonreasoning models overall, with Grok 4 scoring highest on PrIME-LLM.
LLMs tended to prematurely converge on a single diagnosis rather than maintaining a differential, limiting their clinical reasoning fidelity.
Multimodal image-capable models showed some accuracy improvements on image-based questions, but text-only performance remained more consistent.

Clinical Implications

These findings underscore the current limitations of off-the-shelf LLMs in replicating the nuanced clinical reasoning process, particularly in generating and refining differential diagnoses. Clinicians should exercise caution when integrating AI tools into diagnostic workflows, recognizing that these models may provide accurate final answers but lack the reasoning transparency and uncertainty management essential for safe clinical decision-making.

Conclusion

Despite advances and reasoning optimizations, current LLMs fall short in differential diagnosis and comprehensive clinical reasoning, indicating they remain adjunct tools requiring physician oversight rather than autonomous decision-makers.

References

JAMA Network Open Original Investigation -- AI Falls Short on Differential Dx
Tordjman M, Mei X. Invited Commentary, Icahn School of Medicine at Mount Sinai

Daily News

Stay up to date with the latest clinical headlines and other information tailored to your specialty.

Side Effects: When More Is Less

AI Falls Short on Differential Dx

AI Models Show High Final Diagnosis Accuracy but Poor Differential Diagnosis

Overview

Background

Data Highlights

Key Findings

Clinical Implications

Conclusion

References

Daily News

Recommendations

Surgical Techniques to Remove Subretinal Perfluoro-n-Octane

Surgical Techniques to Remove Subretinal Perfluoro-n-Octane

Surgical Techniques to Remove Subretinal Perfluoro-n-Octane

Surgical Techniques to Remove Subretinal Perfluoro-n-Octane

Trending Now

Side Effects: When More Is Less

Trending Now

Find topic & conditions by first letter

Compendium

Inside Dental Hygiene

Inside Dental Technology

Inside Dentistry

The ASCO Post

JADPRO

JNCCN

JNCCN 360

Corneal Physician

Glaucoma Physician

New Retinal Physician

Ophthalmology Management

Ophthalmic Professional

Presbyopia Physician

Retinal Physician

The Ophthalmologist

Contact Lens Spectrum

Eyecare Business

Optometric Management

Presbyopia Physician

The New Optometrist

The Pathologist

AI Falls Short on Differential Dx

AI Models Show High Final Diagnosis Accuracy but Poor Differential Diagnosis

Overview

Background

Data Highlights

Key Findings

Clinical Implications

Conclusion

References

Daily News

Recommendations

Surgical Techniques to Remove Subretinal Perfluoro-n-Octane

Surgical Techniques to Remove Subretinal Perfluoro-n-Octane

Surgical Techniques to Remove Subretinal Perfluoro-n-Octane

Surgical Techniques to Remove Subretinal Perfluoro-n-Octane