Clinical Scorecard: AI Falls Short on Differential Dx
At a Glance
| Category | Detail |
|---|---|
| Condition | Clinical diagnostic reasoning using AI large language models |
| Key Mechanisms | Evaluation of LLMs across differential diagnosis, diagnostic testing, final diagnosis, management, and clinical reasoning tasks using the PrIME-LLM metric |
| Target Population | Clinical scenarios represented by standardized vignettes from the MSD Manual |
| Care Setting | Clinical decision-making environments where AI tools might assist diagnosis and management |
Key Highlights
- LLMs achieved high accuracy on final diagnosis tasks (81%-90%) but performed poorly on differential diagnosis with failure rates >80%.
- Reasoning-optimized models outperformed nonreasoning models overall, but all struggled with maintaining and refining differential diagnoses.
- Multimodal image-capable models showed mixed improvements; text-only performance was more stable.
Guideline-Based Recommendations
Diagnosis
- Current LLMs should not be relied upon for generating comprehensive differential diagnoses due to high failure rates.
- Physicians must maintain primary responsibility for diagnostic reasoning and decision-making.
Management
- LLMs may assist with management tasks but require careful supervision and validation by clinicians.
Monitoring & Follow-up
- Ongoing evaluation of AI tools using metrics that assess the full clinical workflow, including reasoning processes, is essential.
Risks
- Premature convergence on single diagnoses by LLMs can lead to missed alternative diagnoses.
- Variability and hallucinations in LLM outputs pose risks for clinical deployment without oversight.
Patient & Prescribing Data
Simulated patients represented by standardized clinical vignettes
LLMs showed intermediate accuracy in management tasks but lack demonstrated advanced clinical reasoning for safe autonomous use.
Clinical Best Practices
- Use LLMs as adjunct tools under direct physician supervision rather than autonomous decision-makers.
- Evaluate AI model outputs critically, especially differential diagnoses, to avoid premature diagnostic closure.
- Incorporate evaluation frameworks like PrIME-LLM that assess reasoning across the clinical workflow.
- Remain cautious of variability and hallucinations inherent in current LLM architectures.
- Prioritize physician judgment and clinical expertise over AI-generated conclusions.
References
- JAMA Network Open Original Investigation
- Invited Commentary by Tordjman and Mei, Icahn School of Medicine
This content is an AI-generated, fully rewritten summary based on a published scholarly article. It does not reproduce the original text and is not a substitute for the original publication. Readers are encouraged to consult the source for full context, data, and methodology.