AI Medical Triage Faces Scrutiny Over Bias in Patient Communication
A new study from MIT indicates that artificial intelligence models utilized for patient message triage may be more sensitive to writing style than previously understood, potentially impacting patient care. This raises concerns about fairness, particularly for vulnerable groups, as these models are increasingly deployed in clinical settings.
AI’s Sensitivity to Language
Researchers at MIT have discovered that language models (LLMs) can be significantly swayed by stylistic nuances within patient messages. These nuances can include minor alterations such as typos, informal language, and even variations in spacing. The study, presented at the Association for Computing Machinery (ACM) Conference on Fairness, Accountability and Transparency, examined the impact of these changes across thousands of clinical scenarios.
“This is strong evidence that models must be audited before use in health care — which is a setting where they are already in use,”
—Marzyeh Ghassemi, Ph.D., Senior Author, MIT
The study revealed that LLMs were more prone to recommending self-management over medical care when messages were modified, a difference of 7–9% across the tested models, including GPT-4. This highlights the potential for patients with health anxiety or those with non-native English fluency to be incorrectly advised to stay home when medical attention is needed. The CDC reports that approximately 1 in 5 adults in the U.S. experience some form of mental illness each year (CDC, 2024).
How the Study Worked
The research team conducted a three-step process to assess the effects of stylistic changes. They initially altered patient messages with minor, realistic changes. Then, they ran both the original and modified messages through the LLMs to collect treatment recommendations. Finally, they compared the responses, examining for consistencies and disparities. Human-validated answers served as a benchmark for accuracy.
The study also revealed that LLMs were more inclined to reduce care recommendations for female patients compared to male patients, even when gender markers were removed. Furthermore, the inclusion of extra white space amplified these discrepancies, increasing reduced care errors by more than 5% for female patients.
Implications for Healthcare
The research emphasizes the “brittleness” of AI in medical reasoning. Slight, non-clinical variations in how a patient writes can drastically alter care decisions. Human clinicians, however, were not influenced by the same alterations, reinforcing the fragility of LLMs in this context.
The study’s findings support more rigorous auditing and subgroup testing before deploying LLMs in critical healthcare settings. Researchers like Abinitha Gourabathina emphasize the importance of understanding the direction of errors: “Not recommending visitation when you should is much more harmful than doing the opposite.”