Synthetic retinal images generated by denoising diffusion probabilistic models can improve the performance of deep learning classifiers in identifying retinal diseases, according to the results of a recent study.
Led by Prashant U. Pandey of the School of Biomedical Engineering at The University of British Columbia in Canada, researchers had two primary aims:
-
Assess whether denoising diffusion probabilistic models (DDPMs) could generate realistic retinal images representing diabetic retinopathy (DR), age-related macular degeneration (AMD), glaucoma, and normal eyes.
-
Evaluate whether augmenting training data sets with these synthetic images improved the performance of a deep convolutional neural network (CNN) ensemble in classifying retinal disease.
Investigators used 144,513 retinal fundus images from 13 public data sets to train a latent diffusion model (Medfusion DDPM) and a CNN ensemble based on the Inception-V3 architecture. The CNN was benchmarked against 7 board-certified ophthalmologists (BCOs) using a test set of 100 real images, according to research published in the British Journal of Ophthalmology.
They found that DDPM-generated images had a Fréchet Inception Distance (FID) of 27.03 and an inception score of 1.67, which referred to the quality of the generated images.
Of the mean FID score, the researchers wrote, “we found that our method generated images … of slightly higher quality than that of conditional GAN-generated fluorescein angiography images (FID = 30.3) but achieved worse quality than that of OCT image generation using the ProGAN methodology (FID range = 12.02–16.78). Our study produced worse scores for fundus image generation than the initial Medfusion study (mean fundus FID = 11.63), which is potentially due to our methodology focusing on generating four disease classes instead of only two.”
Eight BCOs assessed 96 randomized images (48 real, 48 generated). Mean accuracy for distinguishing real from generated images was 61.1% (range = 51% to 68.8%). They were able to differentiate generated images from real images more accurately for normal and glaucoma-generated images than AMD and DR generated images.
The investigators hypothesized that this differentiation may have been because fewer abnormalities are expected in normal and glaucoma images, while AMD and DR images have a wider range of possible abnormalities that may not have caused as much suspicion for generated images.
BCO disease classification accuracy was 57.6% overall, 47.1% for generated images, and 68% for real images (P < .01). They more accurately classified diseases in generated AMD and normal images (78.1% and 76.6%, respectively), compared with disease classification accuracy of 38.5% for generated glaucoma images and 37% for generated DR images.
The CNN ensemble, even without synthetic images, outperformed the BCOs. Baseline accuracy for CNN was 80% vs 72.7% for BCOs. The highest performing model with 238 generated AMD images achieved an F-score of 85.1% compared with 72.2% for BCOs.
Augmenting CNN training with 238 generated AMD images to match the number of real AMD samples resulted in mean accuracy increase from 80% to 84.6% and mean F-score improvement from 80.8% to 85.1%. Adding more or fewer synthetic AMD images did not yield statistically significant gains. Including generated images for all disease classes also did not result in further improvement.
The researchers acknowledged limitations, including the small test set of 100 images that was used to limit human rater fatigue, lack of standardized grading and clinical validation across public data sets that introduced potential label noise, and no clinical validation of synthetic images, which may have affected interpretability. The limited number of AMD images required overlapping data sets for training and testing.
“These findings highlight the need for careful consideration of the augmentation strategy, particularly in relation to the disease category and the quality of generated images,” the researchers concluded. “Overall, we believe that this study demonstrates the potential of generative models to overcome data scarcity and improve deep learning-based medical diagnostics.”
No competing interests were declared.