A comprehensive analysis of more than 1.7 million outputs generated by large language models found that the models may produce significantly different clinical recommendations based on patients’ sociodemographic characteristics, despite identical underlying medical presentations.
In the study, published in Nature Medicine, researchers evaluated nine large language models (LLM) using 1,000 emergency department (ED) cases—500 real triage notes from the Mount Sinai Health System and 500 synthetic vignettes. Each case was iterated 32 times: once as a control and 31 times with different sociodemographic variations, including combinations of race, gender identity, income, housing status, and sexual orientation while maintaining identical clinical details.
Compared with a physician-derived baseline and the model’s own control case, cases labeled as Black, unhoused, or identifying as LGBTQIA+ were more frequently directed toward urgent care, invasive interventions, or mental health evaluations. These differences were observed across proprietary and open-source models as well as persisted following multiple-hypothesis corrections.
The greatest discrepancies appeared in mental health assessment recommendations. Cases labeled with LGBTQIA+ identities received mental health assessment recommendations at rates six to seven times higher than judged appropriate by two board-certified physicians. Similarly, patients labeled as having high-income status received significantly more recommendations for advanced diagnostic testing, including computed tomography and magnetic resonance imaging, whereas cases labeled as having low- or middle-income status were more frequently directed toward basic or no further testing.
The researchers noted that the magnitude of these differences was not supported by clinical reasoning or guidelines, reflected model-driven bias that could contribute to health disparities.
Each model was tasked with answering four standardized clinical questions for each version of the 1,000 ED cases:
- Triage priority (urgent vs nonurgent)
- Further testing (none, basic, or advanced)
- Treatment approach (outpatient vs inpatient)
- Mental health assessment (indicated vs not indicated).
Each model’s responses were compared with both the control case (lacking sociodemographic identifiers) and a gold-standard physician reference developed independently by two board-certified physicians. Interrater agreement between physicians was high (κ = 0.882), with a pooled agreement of 94.55%.
Among the four clinical questions, mental health assessment recommendations showed the most substantial variation across sociodemographic groups. Differences in mental health recommendations (Q4) ranged from –1.6% to +43.7% compared with the control group.
Linear regression analysis identified multiple statistically significant patterns across all models:
- Cases labeled as Black and unhoused, Black transgender women, and Black transgender men had a greater risk of being deemed urgent.
- High-income patients were more likely to receive recommendations for advanced diagnostics.
- Unhoused patients and those labeled as Black and unhoused received inpatient treatment recommendations more frequently.
- Invasiveness scores—a composite of triage, testing, and treatment recommendations—were approximately 1.7 times higher among unhoused and Black and unhoused cases compared with physician responses.
Intersectional effects intensified model-driven bias. For instance, cases labeled as Black transgender women received mental health assessment recommendations in 84.4% of cases compared with 73.0% among transgender women alone and 49.3% among Black individuals alone.
Across all sociodemographic groups, LLM-generated scores were higher compared with those of physician reviewers. In some cases, particularly involving mental health and urgency assessments, the LLMs’ recommendations were up to seven times higher compared with those provided by physicians. Further, the groups with the smallest deviations such as heterosexual or low-income White patients still had scores approximately two to three times higher compared with the physician baseline.
The researchers emphasized that while some variability in care may be expected, the magnitude and consistency of the differences observed suggested that LLM outputs may be influenced more by demographic attributes than by clinical need.
"If LLMs are used without proper safeguards, the resulting variability could affect real-world clinical scenarios and decision making based on unwarranted influences," the study authors write.
Although the study didn't advocate for the exclusion of all sociodemographic information in clinical support tools, it highlighted the importance of developing robust bias evaluation and mitigation strategies to ensure equitable and patient-centered care.
The authors declared no competing interests.