Artificial intelligence systems can assign computed tomography and magnetic resonance imaging examination protocols with an overall accuracy of about 85%, performing similarly across several types of models, according to a systematic review and meta-analysis of 23 studies involving nearly 1.2 million imaging orders.
Across the analysis, protocoling accuracy was 83% for traditional machine learning models, 87% for transformer-based models built on bidirectional encoder representations from transformers, and 86% for large language models. Differences between approaches were not statistically significant.
For the study published in the American Journal of Roentgenology, investigators led by Ethan Sacoransky, MD, of the University of Saskatchewan in Canada reviewed studies published between 2017 and 2025 evaluating artificial intelligence tools designed to automate assignment of computed tomography or magnetic resonance imaging protocols using information from imaging requisitions.
The researchers searched multiple databases through July 2025 and included English-language studies reporting quantitative performance measures such as accuracy, precision, recall, or F1 score.
Across the 23 studies, investigators analyzed 57 performance results representing 30 distinct models. Training datasets ranged from 1,235 to 559,305 cases, with a median of about 61,600 cases. Test datasets ranged from 100 to nearly 140,000 cases.
Six studies evaluated computed tomography requisitions, 10 evaluated magnetic resonance imaging requisitions, and seven included both imaging modalities.
Traditional machine learning models—including random forests, support vector machines, gradient-boosting machines, and deep neural networks—were evaluated in 16 studies and achieved accuracies ranging from 65% to 95%.
Eight studies evaluated transformer-based models built on bidirectional encoder representations from transformers. All incorporated task-specific fine-tuning using imaging requisition text. These models achieved accuracies ranging from 78% to 93%, with the highest-performing individual model, BioBERT, reaching 93%. Among the 10 best-performing models identified across studies, six were BERT-based.
Large language models were evaluated in five studies, including GPT-3.5 Turbo, GPT-4, GPT-4o, and reasoning-focused variants. Accuracy ranged from 78% to 92%, with a pooled accuracy of 86%. Only one study used task-specific fine-tuning, which the investigators suggested may partly explain why transformer-based BERT models performed slightly better overall.
Subgroup analyses showed similar performance across imaging modalities. Accuracy was 83% for computed tomography requisitions and 85% for magnetic resonance imaging requisitions. Accuracy was also somewhat higher for English-language requisitions compared with non-English requisitions.
Investigators identified several common sources of protocoling errors. Ambiguous or incomplete requisition text—such as vague symptoms or missing anatomic details—often led to incorrect protocol selection. Data imbalance also played a role, as models trained on datasets dominated by common protocols performed poorly when predicting rare protocol categories.
In addition, some artificial intelligence “errors” reflected clinically acceptable alternative protocols rather than clear mistakes, highlighting variability in radiologist decision-making.
Despite these limitations, the authors said current performance levels suggest artificial intelligence could help streamline radiology workflows if implemented cautiously. Hybrid systems in which algorithms automatically assign routine protocols while uncertain cases are referred to radiologists for review may offer the most practical near-term approach.
“The tools show strong potential to help streamline radiologist workflows, possibly through hybrid AI-radiologist approaches,” the investigators wrote.
The authors said future research should focus on prospective clinical trials and further development of fine-tuned large language models.