Investigators examined how nonclinical features—such as punctuation, tone, formatting, and gender cues—influence AI-generated medical advice. They found that minor changes in a message’s style, including added whitespace, emotional language, or altered pronouns, could significantly alter a model’s diagnosis or treatment recommendation.
Researchers tested four large language models (LLMs), including GPT-4 and Palmyra-Med, across more than 8,000 patient scenarios using three clinical datasets. They introduced nine types of semantic-preserving text changes that varied surface-level features but retained clinical meaning. These included gender-swapping, inserting typos, and modifying tone to sound more anxious or dramatic.
Across all models and datasets, investigators observed an average 7% to 9% increase in treatment recommendation changes after perturbation (P < .005). “Colorful” language had the most pronounced effect, prompting a 12.9% shift in care suggestions.
Models frequently recommended lower levels of care after perturbation. For example, patients originally advised to seek clinical attention were sometimes redirected to self-manage at home. Across all perturbations, 4.55% of treatment recommendations were downgraded, and 4.12% of those downgrades were incorrect based on expert labels.
Certain subgroups were more vulnerable. Patients with lower digital literacy or limited English proficiency—simulated by poor grammar, typos, or all-caps messages—were more likely to receive downgraded care. Even the insertion of random whitespace alone reduced the likelihood of follow-up care recommendations.
Female patients were not recommended to visit a clinician when they actually should have at a higher rate than their male counterparts. — Gourabathina et al, FAccT ’25
Disparities were also observed by gender. Female patients and those using language patterns associated with women were more likely to be told to delay or forgo clinical care. Even after explicit gender markers were removed, models inferred gender from writing style and still altered treatment accordingly.
In conversational settings that mimic patient-AI interactions, clinical accuracy declined by approximately 7% across all seven perturbations. The most significant drop—9.1%—occurred during multiturn exchanges with whitespace-modified input.
Although LLMs show promise in health care, the findings demonstrate their sensitivity to superficial textual changes. The authors recommend fairness audits and real-world simulation tests to ensure AI tools remain accurate and equitable—regardless of patient tone, writing style, or demographics.
No explicit conflict of interest disclosures were provided by the authors. Funding support was acknowledged from multiple research institutions and philanthropic organizations, including Stanford HAI, NSF, and the Chan Zuckerberg Initiative.
Source: FAccT ’25