Researchers explored the potential of large language models to assist neuropathologists in diagnosing complex brain tumors.
In a study, published in The Journal of Pathology Clinical Research, researchers tested the diagnostic accuracy of three leading large language models (LLMs)—ChatGPT-4o, Claude-3.5-sonnet, and Llama3—on a set of 30 challenging neuropathology cases. These cases incorporated intricate morphological and genetic data to reflect the complexity of real-world clinical practice.
The results showed that while LLMs struggled in their "zero-shot" (unassisted) form, their accuracy soared when paired with a Retrieval-Augmented Generation (RAG) approach. This framework integrates clinical guidelines, such as the latest World Health Organization (WHO) classification for central nervous system (CNS) tumors, into the model's decision-making process.
Neuropathology is uniquely complex, blending morphological, genetic, and epigenetic criteria for diagnosis—a challenge compounded by frequent updates to the WHO guidelines. Even seasoned specialists can find it difficult to stay current. The researchers worked under a hypothesis that LLMs could be a valuable support for neuropathologists during the diagnostic process.
When tested in zero-shot mode, the models achieved an overall accuracy of just 22.2%. Among the models, Claude-3.5-sonnet showed the best performance with 53.3% accuracy. Common errors included outdated terminology and misclassification of tumor grades.
However, when equipped with RAG, the accuracy of all three models skyrocketed to 90%. ChatGPT-4o emerged as the leader, correctly diagnosing 93.3% of cases, followed closely by Claude-3.5-sonnet at 90% and Llama3 at 86.7%. The RAG-enabled models excelled at distinguishing tumor subtypes including astrocytomas, glioblastomas, and oligodendrogliomas, with only minor discrepancies.
One critical error noted in the study was ChatGPT's misclassification of an early glioblastoma as a low-grade astrocytoma—an error with significant clinical implications for patient management and outcomes. However, the researchers noted that this specific case presented a “diagnostic challenge.”
The study focused exclusively on adult-type diffuse gliomas, leaving a vast array of neuropathological entities unexamined. Additionally, the challenge of safeguarding patient data in AI workflows remains a significant barrier to clinical adoption, the authors added.
Future research will need to test these models across a broader range of diagnoses, including rare and ambiguous cases.
Conflict of interest disclosures can be found in the study.