- Large language models performed well in final diagnosis but poorly in differential diagnosis.
- Early-stage clinical reasoning remains a major weakness.
- Overall accuracy masks critical reasoning gaps.
- Multimodal (imaging) gains were limited and inconsistent.
- Large language models are not ready for unsupervised clinical use.
Source: JAMA Network Open Original Investigation, Invited Commentary