Objective:
To evaluate the accuracy of artificial intelligence systems in detecting laryngeal disorders.
Approach:
- AI models performed best in binary classification tasks, with accuracies ranging from 88% to 99% for distinguishing healthy from pathologic voices.
- Performance declined to approximately 70% to 90% for broader pathophysiologic categories and generally remained below 75% for specific disorders.
- AI performance varied by model architecture and data type, with traditional machine-learning achieving 88% to 96% accuracy for binary tasks and deep-learning systems achieving 97% to 99% on standardized datasets.
- Most studies relied on internal validation, with performance often declining by 10-20 percentage points on independent cohorts.
- Many studies had methodological concerns, including dependence on limited databases, class imbalance, and lack of demographic diversity.
- Approximately 82% of studies used sustained-vowel tasks, which may not capture clinically relevant vocal variability.
- Fewer than 15% of studies shared source code or complete model documentation, limiting reproducibility.
Key Findings:
Interpretation:
The decline in performance from detection to diagnosis is attributed to acoustic overlap among laryngeal disorders, where distinct diseases can produce similar voice abnormalities.
Limitations:
Conclusion:
Current evidence supports AI primarily as a tool for screening, triage, and decision support rather than as an autonomous diagnostic system.
Sources:
This content is an AI-generated, fully rewritten summary based on a published scholarly article. It does not reproduce the original text and is not a substitute for the original publication. Readers are encouraged to consult the source for full context, data, and methodology.