Artificial intelligence (AI) models analyzing imaging, clinical, and molecular data may improve the diagnosis and prediction of polycystic ovary syndrome, according to a systematic review of 80 studies that found some tools consistently outperform traditional diagnostic approaches and identified potential molecular biomarkers.
The review, conducted by Mustafa Ghaderzadeh, PhD, of Urmia University of Medical Sciences, Iran, and colleagues and published in BMC Medical Informatics and Decision Making, examined applications of AI in diagnostics, biomarker discovery, risk prediction, and clinical decision support for patients with polycystic ovary syndrome (PCOS), as well as model interpretability and the emerging use of generative AI.
As the researchers explained, PCOS affects an estimated 8% to 13% of women of reproductive age worldwide but is often underdiagnosed or misdiagnosed because of variability in phenotypic presentation and inconsistent use of international diagnostic criteria. Standard diagnosis combines clinical evaluation, biochemical markers, and ultrasound imaging. These approaches, however, present their own issues, as ultrasound interpretation is operator-dependent, laboratory thresholds vary across populations, and phenotypic diversity complicates reproducibility.
“AI methods … have recently emerged as promising approaches to address these gaps,” they wrote. “This systematic review showed that … machine learning [ML], deep learning [DL], explainable AI (XAI), and large language models (LLMs) have significant potential to improve the diagnosis, prognosis, and management of PCOS.”
Review Synthesized 80 AI Studies
The researchers searched PubMed, Scopus, and Web of Science from database inception through March 2025. Of 662 records identified, 80 studies met the inclusion criteria after screening and full-text review.
Study designs included retrospective cohorts, cross-sectional studies, case-control analyses, and randomized controlled trials conducted across multiple countries. Sample sizes ranged from fewer than 50 participants to more than 30,000 records.
Given the heterogeneity of designs and outcomes, the studies were grouped into three main domains: imaging-, clinical and electronic health record (EHR)-, and biomarker or omics-based studies.
-
Imaging: Convolutional neural networks—the dominant models in imaging studies—analyzing ultrasound and MRI-derived data reported diagnostic accuracies that were often found to exceed 95% and, in some studies, approach 98% to 99%.
-
Clinical and EHR: Supervised ML algorithms, including random forests and support vector machines, were applied to structured features such as demographics, hormone levels, and metabolic indicators, and were reported to achieve consistently high performance in clinical and biochemical data sets.
-
Biomarker and omics: ML analyses identified candidate biomarkers including HDDC3, SDC2, MAP1LC3A, and OVGP1, “with potential for risk stratification.”
Median reported accuracy for supervised ML models ranged from 78% to 95%, the researchers wrote, with DL approaches frequently demonstrating higher performance.
Explainability and Validation
About one-quarter of the included studies incorporated XAI techniques, including SHAP, LIME, and Grad-CAM. The researchers stated, “Where used, these improved interpretability and clinician confidence, but most models remained opaque.”
Risk-of-bias assessments using QUADAS-2 and ROBIS identified recurring issues with patient selection, data set representativeness, and lack of external validation. Many studies also relied heavily on Kaggle data sets, which the researchers said may limit generalizability.
The researchers also noted that across studies, key limitations included small sample sizes, class imbalance, and methodologic heterogeneity.
“While the performance metrics are encouraging, the evidence base remains fragile, prone to bias, and at risk of overfitting,” they wrote.
Emerging Role of LLMs
The review found that LLMs such as ChatGPT, BERT, and Gemini are being explored for patient communication, clinical note summarization, and literature synthesis.
However, researchers noted that concerns remain about the accuracy and reliability of these technologies. Reported risks include biased outputs, hallucinated information, and lack of domain-specific medical training.
“For the safe and effective use of LLMs in health systems, there is a need to develop mechanisms for transparency, repeated validation, and the integration of organized and reliable medical data,” they stated.
Path to Clinical Integration
“AI has the potential to enhance early diagnosis and personalized management of PCOS, but adoption will depend on reproducibility, transparency, and clinician trust rather than accuracy alone,” researchers wrote. They added that clinical adoption remains limited because of gaps in interpretability and inadequate validation.
To combat current limitations in AI tools, the review did identify priorities for future research. These included large multicenter studies, multimodal integration, routine incorporation of XAI, and careful evaluation of LLMs “to ensure safe, equitable, and clinically meaningful integration into PCOS care.”
The researchers reported no conflicts of interest.