In a study published in npj Digital Medicine, researchers evaluated 101 US radiologists who completed 2,020 diagnostic assessments across 20 radiology cases drawn from the New England Journal of Medicine Image Challenge. Cases included brief clinical vignettes paired with computed tomography or magnetic resonance imaging scans.
Participants were randomly assigned to one of four groups: no large language model (LLM) support, standard-output support, differential-diagnosis support, or chain-of-thought support. The chain-of-thought prompt instructed GPT-4 to “Diagnose the patient and explain step-by-step.”
Physicians in the control group achieved diagnostic accuracy of approximately 56% to 60%, depending on the analytical model used. Radiologists assigned to receive chain-of-thought explanations improved diagnostic accuracy by 12 percentage points compared with the control group, increasing performance into roughly the upper-60% range. Chain-of-thought explanations also improved accuracy by 7 percentage points compared with standard-output support and by 10 percentage points compared with differential-diagnosis support.
At the LLM level, GPT-4 achieved 75% diagnostic accuracy with standard-output prompting and 80% accuracy with chain-of-thought prompting. Differential-diagnosis prompting achieved 65% top-1 accuracy, although the correct diagnosis appeared somewhere within the top five suggestions in 80% of cases.
Researchers noted that even with this more permissive top-5 framework, differential-diagnosis support did not significantly improve physician diagnostic accuracy compared with no LLM assistance in the primary analysis.
In a secondary adjusted analysis accounting for factors including decision time and output length, radiologists receiving differential-diagnosis support performed worse than the control group, whereas chain-of-thought support continued to outperform the control condition.
The study also evaluated how physicians responded to incorrect model recommendations. Radiologists receiving differential-diagnosis outputs were more likely to follow incorrect LLM suggestions, while those receiving chain-of-thought explanations were more likely to override incorrect recommendations and adhere to correct ones.
The researchers reported that the pattern was generally consistent across radiologists with differing levels of information technology proficiency, years of experience, and subspecialty expertise, although subgroup analyses were based on relatively small study arms ranging from 23 to 30 physicians.
The study used a between-subjects design, meaning each radiologist participated in only one study condition. Researchers acknowledged that although the sample substantially exceeded those of similar prior studies, unmeasured differences between groups could not be fully excluded.
The researchers also cautioned that the findings came from a controlled vignette setting rather than routine clinical practice. The study did not evaluate patient outcomes, longitudinal use, workflow integration, or harms associated with incorrect diagnoses.
The investigators further noted that chain-of-thought prompting modestly improved the LLM’s own diagnostic performance, making it difficult to completely separate explanation format from underlying model accuracy. However, robustness analyses controlling for LLM diagnostic accuracy continued to favor chain-of-thought explanations.
“Chain-of-thought explanations improve physician performance by providing transparent reasoning that enables critical evaluation of LLM advice,” the researchers wrote.
Disclosure: The researchers reported no competing interests.
Source: npj Digital Medicine