# Reasoning LLMs Perpetuate Racial and Gender Stereotypes

> Source: <https://letsdatascience.com/news/reasoning-llms-perpetuate-racial-and-gender-stereotypes-7c610a38>
> Published: 2026-05-28 22:06:21.503079+00:00

# Reasoning LLMs Perpetuate Racial and Gender Stereotypes

A study published in the Journal of Medical Internet Research reports that an evaluation of **36,000** clinical vignettes found next-generation reasoning large language models o3-mini and DeepSeek-R1 frequently reproduced racial and gender disease stereotypes. The JMIR paper cites patterns similar to prior work that flagged bias in GPT-4, including overrepresentation of Black patients in stereotypical conditions, and frames the result as evidence that improved reasoning capability alone does not eliminate representational unfairness. Editorial analysis: For clinicians and ML practitioners, the findings underscore that benchmark gains on reasoning tasks do not automatically translate to fairness improvements and that targeted mitigation and auditing remain necessary.

### What happened

The study published in the **Journal of Medical Internet Research** reports that an evaluation of **36,000** clinical vignettes found the reasoning models o3-mini and DeepSeek-R1 frequently reproduced racial and gender disease stereotypes. The paper notes these results echo earlier evaluations that identified similar bias patterns for GPT-4, citing overrepresentation of Black patients in stereotypical conditions in prior work. The JMIR article frames the core finding as evidence that enhancements in reasoning do not inherently resolve representational harms in medical contexts.

### Technical details

The JMIR evaluation applied structured clinical vignettes at scale; the paper reports the **36,000** figure and names o3-mini and DeepSeek-R1 as the tested models. The manuscript compares model outputs across race and gender cues embedded in the vignettes and quantifies propensity to associate demographic attributes with specific diagnoses. The authors situate their methodology as a follow-on to earlier vignette-based bias assessments published in Lancet Digital Health and related literature.

### Industry context

Editorial analysis: Industry reporting and prior literature show a consistent pattern where improvements on reasoning benchmarks do not automatically reduce social biases encoded in large language models. Observers studying clinical deployments note that representational fairness requires targeted dataset curation, evaluation slices, and mitigation techniques distinct from general-purpose reasoning evaluation.

### Implications for practitioners

Editorial analysis: For ML engineers and clinical data scientists, the study indicates the need to include demographic-sliced evaluations when validating models for health applications and to treat reasoning capability and fairness as separate validation axes. Common mitigation approaches to consider, based on the broader literature, include counterfactual data augmentation, calibrated postprocessing, and domain-specific adversarial testing, though the JMIR paper focuses on measurement rather than remediation.

### What to watch

Editorial analysis: Observers should watch for follow-up work that tests mitigation strategies on the same vignette suite and for independent audits of reasoning models in clinical workflows. Tracking whether vendors publish demographic-sliced performance reports or open benchmarking datasets derived from this study will be important for reproducibility and regulatory assessment.

## Scoring Rationale

The paper is notable for scale and domain relevance: a large-scale (36,000 vignette) evaluation in clinical settings raises practical risks for healthcare deployment. It is important for practitioners but not a paradigm-shifting technical advance.

Practice with real Health & Insurance data

90 SQL & Python problems · 15 industry datasets

250 free problems · No credit card

[See all Health & Insurance problems](/problems/datasets/health)
