A novel machine learning model may be effective at screening for hypertension by analyzing acoustic features of speech, according to a new study.
In the study, published in IEEE Access, researchers from the University of Toronto and Klick Inc. developed predictive models using spectrotemporal acoustic features extracted from the speech recordings of 245 participants (91 women). They explored multiple approaches, including analyzing whole recordings, dividing recordings into quarters, and stacking features from all quarters.
Participant demographics:
- Mean age: 40.0 years for women, 39.0 years for men
- Mean body mass index: 24.7 kg/m² for women, 25.8 kg/m² for men
- Mean systolic blood pressure (SBP): 122.0 mmHg for women, 127.0 mmHg for men
- Mean diastolic blood pressure (DBP): 80.0 mmHg for women, 82.0 mmHg for men.
Participants recorded the phrase: "Hello, how are you? What is my glucose level right now?" up to six times daily for two weeks. The researchers extracted 160 acoustic features from the recordings, including temporal, spectral, and spectrotemporal characteristics. Key features included Mel-frequency cepstral coefficients (MFCCs), fundamental frequency, and spectral contrast. Separate models were developed for men and women, and the leave-one-subject-out cross-validation method was used to assess generalizability. To address class imbalance, the synthetic minority oversampling technique (SMOTE) was applied.
The researchers evaluated two hypertension thresholds:
- SBP ≥ 135 mmHg or DBP ≥ 85 mmHg
- SBP ≥ 140 mmHg or DBP ≥ 90 mmHg.
For the first threshold, balanced accuracy was 84% in women and 77% in men. For the second threshold, balanced accuracy was 63% for women and 86% for men.
Stacking features from all quarters of recordings produced the best performance. The second quarter of recordings contained the most discriminative information. The linear discriminant analysis (LDA) classifier performed best overall.
The researchers noted several limitations of the study, including a small sample of hypertensive cases and a primarily Indian participant pool, limiting the generalizability of the findings. Additionally, the model required multiple recordings per participant to achieve optimal performance. Using only one recording significantly underperformed compared with using all available recordings. Future investigations will explore methods to reduce the number of required recordings and investigate deep learning approaches to further improve the accuracy of the novel model.