Radiologists who used workflow-integrated generative artificial intelligence draft reports completed plain radiograph documentation faster in a prospective cohort study, with no statistically significant difference in peer-reviewed clinical accuracy or textual quality compared with reports completed without model assistance.
The study, published in JAMA Network Open, evaluated a relatively compact, institution-trained multimodal encoder-decoder transformer model that generated draft radiology reports from plain radiographs and clinical data in the electronic health record. The model was integrated into routine radiology reporting software at a single 12-hospital tertiary care academic health system from November 2023 to April 2024.
The documentation efficiency analysis included 23,960 radiographs from 14,460 patients, including 11,980 reports completed with model assistance and a matched premodel baseline set of 11,980 reports completed without model use. The comparison set was matched by chest vs nonchest study type and radiologist representation, and only studies without trainee involvement were included.
Mean documentation time decreased from 189 seconds without model assistance to 160 seconds with model-assisted reporting, a 15.5% increase in per-study documentation efficiency. The researchers estimated that the reduction represented more than 63 hours of documentation time saved during the study period.
Of the model-assisted studies, 82% were chest radiographs and 18% were nonchest radiographs. The model completed inference in a median of 3 seconds, allowing draft reports to become available within seconds of image acquisition and before radiologists began interpretation. Draft reports required more editing for nonchest studies than for chest studies, with median word error rates of 0.63 and 0.31, respectively.
In blinded peer review of 800 studies, researchers found no statistically significant difference in clinical accuracy or textual quality between reports completed with and without model assistance. However, chest radiographs received higher clinical accuracy ratings than nonchest radiographs overall, and agreement among peer reviewers was fair, limiting the ability of the review process to detect subtle differences in report quality.
Report addenda, used as an indirect marker of reporting corrections, were also similar prior to and following model implementation, occurring in 16 premodel reports and 17 model-assisted reports. However, addenda may not capture errors that were not identified or corrected.
The researchers also evaluated whether model-generated report text could identify unexpected pneumothorax warranting clinical team notification. This prioritization system ran in shadow mode and did not surface alerts to clinicians during the study.
Among 97,651 radiographs screened, the system identified unexpected pneumothorax cases warranting clinical team notification with 73% sensitivity and 99.9% specificity. Priority flags were available a median of 24 seconds following study completion, compared with a median of about 25 minutes for radiologist notification, though whether faster flagging would improve patient outcomes was not evaluated. The system missed 9 of 33 retrospectively eligible cases; all but 1 were described by radiologists as small, suspected, or uncertain.
The researchers described illustrative cases in which delayed pneumothorax recognition may have affected care, including a patient discharged from the emergency department before an attending radiologist later identified a large pneumothorax and another patient whose pneumothorax was recognized following acute oxygen desaturation hours after imaging acquisition.
The study was limited by its observational, nonrandomized design and use of historical controls, which means workflow changes, staffing, case mix, or other secular factors could have contributed to the observed efficiency gains. The model was also evaluated within a single academic health system, which may limit generalizability to other practice settings, including community hospitals and nonacademic centers. The researchers also noted that further study is needed to assess long-term productivity, physician burnout, performance drift, and whether the observed efficiency gains are sustained over time.
“Our results provide initial evidence for benefits of draft reporting using generative AI tools and a framework by which clinician-AI collaboration may effectively integrate into and improve existing clinical workflows,” wrote Jonathan Huang, PhD, of Northwestern University Feinberg School of Medicine, Chicago, and colleagues.
Disclosures: Several researchers reported patents in varying stages of application. J. Alex Heller, MS, and Mozziyar Etemadi, MD, reported equity interests in and consulting fees from Cardiosense Inc outside the submitted work. Dr Etemadi also reported patents licensed to Northwestern Medicine. No other disclosures were reported.
Source: JAMA Network Open