An artificial intelligence–based tool was associated with improved mitotic score accuracy and reproducibility among junior pathologists evaluating breast carcinoma whole-slide images in a preliminary single-center reader study.
Researchers evaluated an artificial intelligence (AI) system designed to detect mitoses, identify mitotic hotspots, and support mitotic score assignment according to the Elston and Ellis grading system. The study included 50 breast carcinoma specimens from Bicêtre Hospital, including 25 biopsies and 25 surgical specimens collected between February 2020 and June 2021.
Three expert pathologists independently reviewed the whole-slide images and established consensus mitotic scores as the reference standard. Notably, the experts initially disagreed on mitotic score assignment in 17 of 50 cases before resolving discrepancies through consensus review, underscoring the known variability of mitotic scoring.
Two junior pathologists then reviewed the same slides twice—once with AI assistance and once without AI assistance—following a washout period of several weeks intended to reduce recall bias. The study did not specify the exact washout duration. During AI-assisted review, the pathologists could view predicted mitotic hotspots and detected mitotic figures but retained autonomy in selecting counting regions while navigating the full whole-slide images.
Compared with expert consensus, mitotic score accuracy increased from 62% without AI assistance to 76% with AI assistance for 1 investigator and from 64% to 78% for the second investigator. Agreement between each investigator and the expert consensus also improved, with weighted Cohen’s kappa increasing from 0.378 to 0.629 for investigator 1 and from 0.457 to 0.726 for investigator 2.
Agreement between the 2 junior pathologists also improved with AI assistance. Weighted Cohen’s kappa increased from 0.482 to 0.672, while intraclass correlation coefficient (ICC) values for mitotic counts increased from 0.591 to 0.883. By comparison, Figure 3c in the study reported an inter-expert ICC of 0.804, providing a benchmark for expert-level reproducibility.
Agreement in selected counting zones also improved, with overlapping regions selected in 44% of slides without AI assistance and 60% with AI assistance.
The largest gains were observed in diagnostically challenging subgroups. Among slides with mitotic score 2 (n = 9), accuracy increased from 6% to 33%, although the subgroup was very small and the findings should be considered exploratory. Among slides with mitotic score 3 (n = 12), accuracy increased from 29% to 67%. No meaningful change was observed among slides with mitotic score 1 cases, where baseline agreement was already high.
Accuracy also improved across specimen types, increasing from 60% to 72% in biopsies and from 66% to 82% in surgical specimens.
The researchers suggested that the observed benefit was driven largely by more consistent hotspot selection. The proportion of slides in which an AI-suggested hotspot intersected the selected counting zone increased from 46% to 80% for 1 investigator and from 62% to 90% for the other investigator.
Although the algorithm demonstrated modest analytical detection performance in prior validation—showing recall of 44% and precision of 28% after classification refinement on the private validation data set—the authors argued that traditional precision and recall metrics may be imperfect benchmarks for mitosis detection because expert annotations themselves are highly variable. They noted that agreement between the algorithm and expert mitotic scoring (ICC = 0.644) approached inter-expert reproducibility levels (ICC = 0.716) reported in earlier validation work.
The AI tool and visualization software evaluated in the study were developed by Primaa, and several study authors, including lead study author Clara Simmat of Primaa in Paris, France, were company employees.
“AI assistance notably improved [mitotic score] accuracy, especially in diagnostically complex cases,” the researchers wrote, adding that the system improved consistency in hotspot selection and scoring “without limiting the pathologists’ final judgment.”
The findings should be interpreted cautiously because the study involved only 2 junior pathologist readers at a single institution and evaluated retrospectively collected specimens in a structured reader-study setting rather than during active clinical workflow or against patient outcomes.
Disclosures: Authors C.S., S.S., N.P., R.P., and M.S. were employees of Primaa. The remaining researchers reported no commercial or financial relationships that could be construed as potential conflicts of interest.
Source: Diagnostics