A multimodal foundation model trained on overnight polysomnography recordings can accurately predict the future onset of 130 distinct medical conditions from a single night of sleep data, according to research published in Nature Medicine.
The model, called SleepFM, achieved a concordance index (C-Index) of at least 0.75 for all 130 conditions, including all-cause mortality (0.84), dementia (0.85), myocardial infarction (0.81), heart failure (0.80), chronic kidney disease (0.79), stroke (0.78), and atrial fibrillation (0.78).
Researchers from Stanford University, the Technical University of Denmark, Rigshospitalet, and other institutions developed SleepFM using a novel leave-one-out contrastive learning approach that accommodates multiple polysomnography (PSG) configurations. The model was pretrained on more than 585,000 hours of PSG recordings from approximately 65,000 patients across four primary cohorts: Stanford Sleep Clinic (35,052 studies), BioSerenity (18,900 studies), the Multi-Ethnic Study of Atherosclerosis (2,237 PSGs), and the Outcomes of Sleep Disorders in Older Men study (3,930 PSGs).
Architecture and Training Approach
SleepFM integrates four signal modalities: brain activity signals (including electroencephalogram and electrooculogram), electrocardiography, electromyography, and respiratory signals. The architecture employs one-dimensional convolutional layers for feature extraction, followed by channel-agnostic attention pooling to address variability in channel number and order across cohorts. A transformer block captures temporal dependencies over a 5-minute context window.
The preprocessing pipeline resamples all signals to 128 Hz and segments them into 5-second windows serving as the model's fundamental input tokens. For patient-level prediction tasks, an additional temporal pooling layer compresses all token embeddings into a single 128-dimensional embedding.
For disease prediction, researchers paired Stanford Sleep Clinic data with electronic health records, extracting all diagnostic codes (International Classification of Diseases, Ninth Revision and Tenth Revision) and mapping them to phecodes—a hierarchical system of 1,868 disease categories. Positive cases were defined as patients whose first phecode instance occurred more than 7 days following the sleep study. The team excluded phecodes with prevalence below 1.5%, resulting in 1,041 phecodes for evaluation.
Disease-Specific Performance
Among neurological conditions, the model demonstrated particularly strong results for Parkinson disease (area under the receiver operating characteristic curve [AUROC], 0.93), mild cognitive impairment (AUROC, 0.84), and developmental delays and disorders (AUROC, 0.84). For Alzheimer disease specifically, SleepFM achieved a C-Index of 0.91.
Circulatory conditions showed robust prediction, with hypertensive heart disease reaching an AUROC of 0.88 and intracranial hemorrhage at 0.82. In the neoplasm category, the model predicted prostate cancer (AUROC, 0.90), breast cancer (AUROC, 0.90), and melanomas of skin (AUROC, 0.83).
Analysis stratified by sleep stages and signal modalities revealed that brain activity signals better captured mental and neurological conditions, respiratory signals proved more useful for respiratory and metabolic disorders, and electrocardiography signals were more informative for circulatory diseases. However, combining all modalities yielded optimal overall performance.
Transfer Learning and Generalization
To assess cross-site generalization, researchers evaluated SleepFM on the Sleep Heart Health Study (SHHS)—a multicenter data set of 6,441 adults aged 40 years and older that was entirely excluded from pretraining. On this external validation set (n = 2,000), the model achieved statistically significant accuracy for stroke (C-Index, 0.82), congestive heart failure (C-Index, 0.85), and cardiovascular disease–related mortality (C-Index, 0.88).
Temporal generalization testing on Stanford patients from 2020 onward—with all model pretraining and training conducted on pre-2020 data—showed robust performance for death (C-Index, 0.83), heart failure (C-Index, 0.80), and dementia (C-Index, 0.83).
Comparison With Baseline Models
SleepFM consistently outperformed two supervised baselines: a demographics model (multilayer perceptron trained on age, sex, body mass index, and race/ethnicity) and an end-to-end PSG model trained directly on raw signals without pretraining.
The percentage difference in AUROC between SleepFM and both baseline models ranged from 5% to 17% across disease categories, with more pronounced gains in neurological and hematopoietic conditions. For all-cause mortality prediction, SleepFM achieved an AUROC of 0.85, compared with 0.78 for both the demographic baseline and end-to-end PSG model.
Notably, SleepFM trained on just 10% of the data outperformed the demographics baseline trained on five times more data across all conditions in the Stanford cohort and four of six conditions in the SHHS cohort.
For specific conditions, SleepFM demonstrated substantial improvements over demographics-only prediction: senile dementia (C-Index, 0.99 vs 0.87), myoneural disorders (0.81 vs 0.42), developmental delays (0.80 vs 0.58), and atherosclerosis (0.92 vs 0.74).
Standard Sleep Analysis Tasks
Beyond disease prediction, SleepFM performed competitively on conventional sleep analysis benchmarks. For sleep staging, the model achieved mean F1 scores of 0.70 to 0.78 across cohorts—comparable to specialized models including U-Sleep, YASA, GSSC, and STAGES. Sleep apnea classification accuracy reached 0.69 for four-category severity classification (based on apnea-hypopnea index thresholds of 5, 15, and 30) and 0.87 for binary presence classification (none/mild vs moderate/severe).
Age estimation yielded a mean absolute error of 7 years with a correlation coefficient of 0.88, with higher accuracy in pediatric and middle-aged patients and greater error in elderly adults. Sex classification achieved an AUROC of 0.86.
Mechanistic Considerations
The researchers contextualized their findings within established sleep-disease relationships. For mortality prediction, they noted that SleepFM likely integrates multifactorial contributors, capturing respiratory events, sleep fragmentation, arousal burden, and sleep efficiency, along with markers of cardiovascular, metabolic, and other diseases.
Regarding neurodegenerative disease prediction, the researchers observed that sleep disorders are associated strongly with preclinical Alzheimer disease, including abnormalities in non-REM sleep such as reduced slow-wave activity, REM sleep disturbances, and decreased spindle activity. For Parkinson disease, they highlighted that REM sleep behavior disorder, characterized by REM sleep without atonia and abnormalities in brain activity and electrocardiography patterns, frequently precedes clinical diagnosis.
Limitations
The researchers acknowledged several constraints. The data set consists primarily of patients referred for sleep studies due to suspected sleep disorders or other medical conditions requiring overnight monitoring, creating selection bias. Performance showed some degradation in temporal test sets, highlighting the challenge of maintaining accuracy over time as clinical practices and patient populations evolve.
Additionally, only a subset of the full 1,041 conditions could be assessed on the SHHS external validation set due to limited diagnostic overlap with Stanford data. Sleep apnea analysis was limited to binary and four-class classification; event detection and apnea-hypopnea index regression were not explored.
The researchers concluded that foundation models can learn the language of sleep from multimodal sleep recordings, enabling scalable, label-efficient analysis and disease prediction, and suggested that as wearable sleep technologies continue to advance, models such as SleepFM may offer opportunities for noninvasive, real-time health monitoring.
Disclosures can be found in the published study.
Source: Nature Medicine.