AI Gender Bias in Healthcare: GPT-5.4 and Claude Risks

AI models have long evolved beyond simple text generators; they are now actively positioning themselves as the first line of medical triage. However, a new study by Kee Kiat Yeong reveals a systemic flaw: neural networks process identical neurological symptoms differently depending on the patient's gender. Given the same history—persistent headaches, blurred vision, and morning nausea—Gemini 3.5 Flash, Claude Sonnet 4.6, and GPT-5.4-mini systematically downgrade the urgency of hospitalization for young women. This is not merely an algorithmic bug, but a deep cognitive error where the AI clings to gender-associated diagnoses to justify less intensive care.

The Mechanism of Diagnostic Substitution

Yeong's methodology involved 630 tests using standardized symptom profiles across various age groups and genders. The results are alarming: young women are referred to emergency departments significantly less often than men presenting with identical complaints. Gemini 3.5 Flash sent 0% of women to the ER compared to 23.3% of men. Claude Sonnet 4.6 demonstrated an even more egregious gap, referring only 6.7% of women versus 96.7% of men. For GPT-5.4-mini, the figures were 6.7% and 66.7%, respectively.

This research is motivated by a well-known phenomenon in clinical medicine: neurological and cardiac symptoms in women are more frequently dismissed as benign or psychosomatic.

Researchers call this "diagnostic substitution." The models preferred to diagnose young women with idiopathic intracranial hypertension (IIH)—a condition statistically linked to women of childbearing age. Meanwhile, men were diagnosed with general increased intracranial pressure, which implies potential brain tumors. Since IIH is considered less life-threatening in the immediate term, the AI referred female patients for routine appointments, ignoring its own severity ratings of 7 to 9 out of 10.

Statistical Traps and Technical Barriers

This bias is driven by epidemiological prior probabilities rather than "malicious intent" in the code. This is evidenced by the fact that the referral gap disappears by age 65—the age when IIH statistics decline. Current alignment methods are unable to scrub the deep-seated stereotypes embedded in medical datasets. The models simply overvalue statistical probability at the expense of clinical safety.

AI triage systems must decouple urgency assessment from probabilistic diagnostic forecasting.

For insurance companies and clinics, using "raw" neural networks for primary patient sorting is becoming a massive legal liability. The study proves that clinical neural networks replicate the worst human biases by using statistical crutches. If a system cannot distinguish between a typical case and a critical risk, it remains a dangerous tool for autonomous decision-making.

Integrating GPT-5.4-mini or Claude Sonnet 4.6 into medical workflows requires a fundamental architectural shift. Developers must acknowledge that general RLHF does not cure gender bias. Emergency assessment must be isolated from diagnostic probability. For the industry, this is a stark warning: "statistically accurate" models can be "clinically criminal" if they allow demographic data to override SOS signals. Yeong’s data serves as a necessary benchmark for auditing systems before they make their first fatal mistake on a real patient.

Source: arXiv cs.AI →

Rate this material

★ ★ ★ ★ ★

AI in HealthcareAI SafetyLarge Language ModelsOpenAIAnthropic

Gender Bias in Medical AI: Why GPT and Claude Underestimate Risks for Women

The Mechanism of Diagnostic Substitution

Statistical Traps and Technical Barriers