- Radiologists receiving chain-of-thought explanations achieved the highest diagnostic accuracy, outperforming the control group by 12.2 percentage points.
- Standard-output and differential-diagnosis formats did not significantly improve physician performance compared with no large language model support in the primary analysis.
- Differential-diagnosis outputs were associated with higher adherence to incorrect model recommendations, suggesting a potential risk of automation bias.
- GPT-4 achieved 80% diagnostic accuracy with chain-of-thought prompting compared with 75% using standard prompting and 65% top-1 accuracy using differential-diagnosis prompting.
- The study was conducted in a controlled radiology vignette setting and did not evaluate patient outcomes, real-world workflow integration, or long-term clinical use.
Daily News
Stay up to date with the latest clinical headlines and other information tailored to your specialty.
Thank you for signing up for the Daily News alerts. You will begin receiving them shortly.
Advertisement
Recommendations
Advertisement