Collaborative artificial intelligence assistance was associated with a nonsignificant 8% reduction in overall chest x-ray report writing time in a small retrospective reader study, with larger time savings observed for reports that took longer to write without assistance — and a slowdown for one of three radiologists.
In the proof-of-concept study, three radiologists interpreted 50 chest x-rays from the MIMIC-CXR data set twice: once with artificial intelligence assistance and once without it. Cases were randomized and counterbalanced across multiple reading sessions, although no formal washout period was used.
During unassisted reads, radiologists used free-text reporting without structured templates or autocomplete tools. The researchers used RaDialog, a large vision-language model designed for chest x-ray reporting. During AI-assisted reporting, the system generated partial report suggestions in real time based on image tokens, structured findings, and the radiologist’s current draft. Radiologists could accept suggestions, modify them, or continue typing to reject them.
Mean writing time, measured from first to last keystroke, was 105 seconds with AI assistance compared with 114 seconds without assistance. Results varied substantially by radiologist: writing time decreased by 27% for one radiologist, decreased by 11% for another, and increased by 9% for the third. Suggestion acceptance rates ranged from 41% to 68%.
In an exploratory analysis stratified by unassisted writing time, AI assistance was associated with an 18% reduction in writing time for reports with longer baseline writing times but a 13% increase for shorter baseline reports. The researchers noted that potential efficiency gains may depend on case complexity, reporting style, and adoption of AI-generated suggestions.
Automated report-quality metrics were broadly similar across reporting conditions. BLEU-1, METEOR, and ROUGE-L scores were slightly higher with AI assistance. CheXbert F1, a classification-based metric comparing major radiological findings, was 0.37 with AI assistance compared with 0.40 without it, numerically favoring the no-AI condition. Mean report length was similar, at 278 characters with AI assistance and 275 characters without it.
Radiologists rated the tool favorably on a 5-point Likert scale, with mean scores of 4.3 for ease of use and perceived writing-speed improvement and 4.0 for desire to use the tool regularly. Scores for AI suggestion errors and system cumbersomeness were low, indicating more favorable ratings for those negatively phrased items.
The findings should be interpreted cautiously. The study included only 50 chest x-rays and three radiologists with 3 to 8 years of experience, evaluated a single model, and used MIMIC-CXR cases rather than a prospective clinical workflow. The unassisted comparator also did not include structured templates, phrase libraries, speech recognition, or other reporting tools commonly used in practice, meaning absolute time savings could differ in template-rich clinical environments.
Report quality was assessed with automated comparisons against reference reports rather than independent radiologist review. An additional limitation relevant to clinical deployment also warrants attention: an ablation analysis found that AI suggestions were guided by image input but not fully constrained by it, meaning language patterns and the radiologist’s existing report text could independently influence generated content. The researchers noted that in low-quality or ambiguous studies, such as portable ICU radiographs, this reliance on language priors may increase the risk of generating plausible but insufficiently image-grounded text.
Generalizability across experience levels also remains uncertain. All three participating radiologists had between 3 and 8 years of experience, and the board-certified radiologist showed the largest reduction in writing time and one of the highest suggestion-acceptance rates. The researchers noted that more senior radiologists with established reporting workflows may adopt AI assistance differently.
“Future studies should expand the sample size, include more radiologists to analyze experience-related adoption and usage patterns and compare multiple models to validate these findings,” wrote lead researcher Chantal Pellegrini, of the Technical University of Munich and the Munich Center of Machine Learning, and colleagues.
The researchers reported no competing interests.
Source: Insights into Imaging