A recent review outlined best practices for the safe integration of large language models and other generative artificial intelligence into radiology workflows. The investigators reviewed current regulatory approaches, data privacy risks, and biases associated with generative artificial intelligence, and provided structured recommendations to guide safe implementation. Radiology reports generated by large language models were shown to match radiologist accuracy, surpass teleradiology preliminaries, and improve physicians’ diagnostic reasoning
The methods focused on a critical review of existing regulatory frameworks, including US Food and Drug Administration criteria for software as a medical device and the European Union AI Act, combined with evidence from empirical studies evaluating large language models (LLMs) in radiology. The researchers analyzed limitations such as performance drift, stochastic output, and hallucinations, with reproducibility studies showing that identical prompts produced inconsistent results across three separate sessions.
The review, published in Radiology, highlighted that regulatory standards based solely on accuracy, sensitivity, or specificity were insufficient. Instead, the authors recommended that evaluation metrics should also assess how closely synthetic outputs align with real-world data, the consistency of responses across repeated prompts, the model’s resistance to generating inaccurate information, and the influence of human-computer interaction on clinical use. They noted that hallucinations, including sycophancy and narrative completion errors, can occur in model outputs and pose reliability challenges in clinical use.
Data privacy also emerged as a critical concern. Evidence showed that proprietary LLMs transmit data to external servers, precluding their use with nondeidentified patient information under the Health Insurance Portability and Accountability Act. Jailbreaking attacks were cited as a major risk, with prior studies demonstrating the extraction of verbatim training data from generative models. Mitigation strategies included local deployment of open-source models, federated learning across institutions, and clear vendor transparency regarding data use agreements.
Bias was identified as a key pitfall. The researchers cited evidence that commercial LLMs reproduced inaccurate race-based medical tropes and reflected demographic disparities in image generation, with one model depicting men in 80% of radiologist images compared with an actual workforce representation of 71%. They recommended bias testing through demographic stress testing of prompts, incorporation of diverse datasets, and postimplementation monitoring to reduce inequities in patient care.
“To mitigate the negative impacts of LLMs and other generative AI models, it is essential to prevent potential pitfalls in three key areas: regulation, data privacy, and bias,” noted the lead researcher Paul H. Yi, MD, of the Department of Radiology, St. Jude Children’s Research Hospital, Memphis, Tennessee, and colleagues.
Full disclosures can be found in the published review.
Source: Radiology