Objective:
To evaluate the performance of large language models (LLMs) in differential diagnosis and other clinical reasoning tasks using a new composite metric, the Proportional Index of Medical Evaluation for LLMs (PrIME-LLM), which aims to provide a more comprehensive assessment than traditional benchmarks.
Key Findings:
- PrIME-LLM scores ranged from 0.64 to 0.78, with reasoning-optimized models outperforming nonreasoning models, indicating a potential pathway for improving clinical AI.
- Differential diagnosis had failure rates exceeding 80%, while final diagnosis tasks had less than 40% failure rates, highlighting critical areas for improvement.
- Diagnostic testing performance was intermediate between differential diagnosis and final diagnosis, suggesting a need for targeted enhancements.
- Final diagnosis tasks were more accurate than both diagnostic testing and differential diagnosis across nearly all models, underscoring the importance of model training in specific domains.
Interpretation:
LLMs struggle with maintaining and refining differential diagnoses, often converging prematurely on a single answer, which highlights significant limitations in processing clinical uncertainty and the need for improved reasoning capabilities.
Limitations:
- Use of publicly available clinical vignettes that may have been included in model training, potentially biasing results.
- Exclusion of augmented tools like retrieval systems that could enhance performance, limiting the applicability of findings.
- Variability in model responses and inherent limitations of current architectures, which may affect the reliability of the results.
Conclusion:
Despite improvements, LLMs have not achieved the necessary intelligence for safe clinical deployment and should be used as supervised adjuncts, with physicians remaining the primary decision-makers, emphasizing the need for ongoing evaluation and refinement.
This content is an AI-generated, fully rewritten summary based on a published scholarly article. It does not reproduce the original text and is not a substitute for the original publication. Readers are encouraged to consult the source for full context, data, and methodology.