# Study Tests Patient Cognitive Bias in LLM Consultations

> Source: <https://letsdatascience.com/news/study-tests-patient-cognitive-bias-in-llm-consultations-1d1d1a86>
> Published: 2026-06-11 21:54:14.388164+00:00

# Study Tests Patient Cognitive Bias in LLM Consultations

A simulation-based comparative study published in the **Journal of Medical Internet Research** (**JMIR**) finds that patient cognitive bias reduces LLM diagnostic accuracy by **10-40 percentage points** (P < .001) across six models tested on **1,273 MedQA-USMLE cases**. Researchers Yi Zuo, Qifeng Wan, and Shalong Wang developed a simulated patient agent that generated confirmation-biased and unbiased consultations, finding that errors frequently reflected user misconceptions -- the bias-influenced error proportion (BIEP) exceeded **33%**. Neither prompt engineering nor temperature adjustments provided consistent resilience. A dual-system framework pairing a foundation model (System 1) with **o1-Mini** as a deliberative reasoning layer (System 2) recovered **10-39 percentage points** of lost accuracy (P < .001). The findings establish user cognitive bias as a newly quantified behavioral risk in patient-facing AI tools, with implications for clinical deployment standards and evaluation benchmarks.

### What the study found

A simulation-based comparative study published in the **Journal of Medical Internet Research** (JMIR) establishes that patient cognitive bias meaningfully degrades LLM diagnostic performance in health consultations. Researchers Yi Zuo, Qifeng Wan, and Shalong Wang developed a simulated patient agent to generate unbiased and confirmation-biased consultations using **1,273 MedQA-USMLE cases**, then evaluated **six LLMs of varying capacities** through multi-turn dialogues. The primary finding: user cognitive bias reduced diagnostic accuracy by **10-40 percentage points** (P < .001), with smaller models occasionally performing near chance level. A secondary metric, the bias-influenced error proportion (BIEP), exceeded **33%** -- meaning a substantial fraction of model errors directly reflected the user's misconceptions rather than independent model reasoning.

### Methods

The study used two bias-simulation modes: unbiased consultations and confirmation-biased consultations in which the simulated patient agent steered dialogue toward a preconceived diagnosis. Authors measured three outcomes: diagnostic accuracy, bias-induced accuracy decline (BIAD, loss under bias), and bias-influenced error proportion (BIEP, fraction of errors aligned with user misconceptions). They then tested four prompt-based mitigation strategies, four temperature settings, and a dual-system framework inspired by dual-process cognitive theory -- System 1 being a standard foundation model and System 2 being o1-Mini as a deliberative reasoning layer.

### Key results

Prompt engineering and temperature adjustments produced limited or inconsistent improvements -- neither reliably counteracted patient confirmation bias. In contrast, the dual-system framework increased accuracy by **10-39 percentage points** and recovered most or all of the bias-driven performance gap (P < .001). This suggests architectural interventions, rather than prompting alone, are needed for bias-resilient clinical AI.

### Why it matters

For practitioners building or evaluating patient-facing AI tools, the study introduces a concrete and previously underspecified failure mode: users themselves are a source of reasoning error. Standard benchmarks such as MedQA do not capture this dimension; the study's BIAD and BIEP metrics provide a practical evaluation vocabulary. The dual-system result offers a deployment path -- pairing a fast response model with a slower deliberative reasoning model may be a scalable safeguard for higher-stakes medical applications.

## Scoring Rationale

Solid niche research with quantitatively significant findings: 10-40 percentage points accuracy drop under patient bias is meaningful and underspecified in existing benchmarks. The dual-system mitigation result has practical deployment relevance. Score reflects a well-executed domain-specific study rather than a paradigm-shifting result.

Practice with real Health & Insurance data

90 SQL & Python problems · 15 industry datasets

250 free problems · No credit card

[See all Health & Insurance problems](/problems/datasets/health)
