Machine learning models demonstrated generally high performance across prosthetic joint infection–related diagnostic and prediction tasks following total hip or knee arthroplasty, according to a systematic review published in the Journal of Orthopaedic Research. However, most models lacked external validation and were developed using retrospective, single-center data, raising questions about real-world applicability.
Prosthetic joint infection (PJI) affects up to 1.7% of patients within 2 years of arthroplasty and is associated with substantial morbidity, reduced quality of life, prolonged hospitalization, and increased health care costs. Five-year mortality rates as high as 21% have been reported in patients with PJI after total hip arthroplasty.
“The diagnosis of PJI remains a challenge due to limitations in current diagnostic criteria,” the researchers wrote. “Machine learning offers a data-driven approach to improve diagnostic accuracy, potentially allowing earlier and more accurate identification to ensure appropriate treatment in a timely manner.”
How the Review Was Conducted
Researchers searched PubMed and Embase for studies applying machine learning to PJI-related clinical problems involving the hip or knee. While many studies focused on diagnosis, others addressed related tasks such as early infection prediction, recurrence, and surgical outcomes.
A total of 12 studies met inclusion criteria after screening 583 records. All used retrospective data sets, with sample sizes ranging from 20 to 17,165 surgeries. Only one study included external validation.
Model inputs varied widely and included patient demographics (11 studies), comorbidities (10), serologic markers (7), synovial fluid analysis (4), microbiology (3), and imaging (3). In total, 23 different machine learning approaches were evaluated, including linear models, tree-based methods, support vector machines, k-nearest neighbors, naive Bayes, and deep learning models.
Diagnostic Performance
Model performance was most commonly assessed using area under the curve (AUC). Reported AUC values ranged from 0.68 to 0.993, spanning acceptable to outstanding performance.
Examples of high-performing approaches included:
-
Decision tree models for preoperative diagnosis, with AUC up to 0.993
-
Meta-learner models for revision arthroplasty evaluation, with AUC up to 0.988
-
Intraoperative prediction models during second-stage revision, with AUC up to 0.968
-
Imaging-based models, with AUCs of 0.957 for knee and 0.906 for hip
In one study of intraoperative diagnosis, a model achieved 100% specificity and higher sensitivity than traditional criteria.
However, the authors noted that in some cases, models were trained and evaluated using the same consensus diagnostic criteria (eg, MSIS or ICM), which may overestimate real-world performance.
Key Limitations and Next Steps
Despite encouraging results, the overall quality of studies was moderate, and several limitations were consistent across the literature. Most models were developed using retrospective, single-institution data with relatively short follow-up periods. External validation was rare, and model interpretability was often limited.
The authors also highlighted potential methodological concerns, including variability in input features and outcome definitions, as well as the risk of circularity when models are trained and tested against the same diagnostic frameworks.
“Machine learning models demonstrate significant potential,” the researchers wrote, “but further work is needed to ensure robustness and clinical applicability.”
They emphasized the importance of multicenter studies using standardized, diverse data sets, along with rigorous external validation and more transparent modeling approaches.
For full disclosures of the researchers, visit onlinelibrary.wiley.com.
Source: Journal of Orthopaedic Research