# AI Chatbots Fail Medical Questions One in Five Times, Study Reveals

> Source: <https://www.gadgetreview.com/ai-chatbots-fail-medical-questions-one-in-five-times-study-reveals>
> Published: 2026-05-29 18:42:35+00:00

Even [ChatGPT’s](https://www.gadgetreview.com/man-uses-chatgpt-to-design-cancer-vaccine-that-saved-his-dogs-life) latest model gets health answers wrong **15%** of the time. That failure rate jumps to **50%** for some AI systems when you ask about symptoms, treatments, or medical decisions. A new Penn State study exposes the gap between AI’s confident responses and actual medical accuracy—a reality check for anyone who’s ever typed “is this rash serious?” into a chatbot.

## Top Chatbots Still Miss Critical Health Details

*Board-certified physicians found significant accuracy gaps across all major AI models when testing 212 real health questions.*

[Researchers](https://www.news-medical.net/news/20260528/AI-chatbots-answer-health-questions-with-moderate-overall-accuracy.aspx) tested four popular [chatbots](https://www.gadgetreview.com/chatgpts-mysterious-name-block)—**ChatGPT-4o**, **ChatGPT-3.5**, **Gemini-1.5 Pro**, and **Llama3-8b**—against health questions gathered through a university competition format. Nine board-certified physicians judged the responses, finding that roughly **24%** fell short of medical standards across all models tested.

ChatGPT-4o performed best with **84.6%** validity, while Llama3-8b managed only about half of its responses correctly. The consistency problem isn’t just about outdated models—even OpenAI’s flagship makes clinically meaningful mistakes that could affect real-world health decisions.

## Medical Specialty Determines AI Performance

*Obstetrics questions scored highest while neurology and dermatology exposed major weaknesses across all chatbot models.*

The type of medical question drastically affected accuracy rates. **Obstetrics and gynecology** queries performed best, while **neurology**, **internal medicine**, and **dermatology** scored poorly across the board.

This pattern makes clinical sense: neurology often involves rare conditions requiring specialized diagnostic knowledge, and dermatology depends heavily on visual assessment that text-only chatbots simply cannot perform. Yet these systems will confidently offer skin condition opinions anyway. Researchers also found that prompt length mattered, with **60 to 250 characters** hitting the sweet spot for accurate responses.

## Enhanced AI Models Don’t Always Improve Results

*Retrieval-augmented generation failed to boost performance consistently, sometimes making responses worse.*

The study tested [ retrieval-augmented generation (RAG)](https://studyfinds.com/best-ai-chatbot-gets-health-questions-wrong-doctors-find/), where chatbots access curated medical knowledge libraries for grounded responses. This approach didn’t consistently improve performance across models.

For Gemini-1.5 Pro and Llama3-8b, physicians actually preferred the standard versions over their RAG-enhanced counterparts. The finding suggests that simply feeding [AI systems](https://www.gadgetreview.com/ai-powered-websites-you-didnt-know-can-supercharge-your-productivity) more medical data doesn’t automatically translate to better clinical advice—a sobering reality for health AI development efforts.

## Doctors See Promise But Warn Against Substitution

*Physicians identify useful applications while emphasizing critical safety limits, especially for mental health scenarios.*

Follow-up interviews revealed physicians see potential in AI for:

- Health literacy support
- Helping patients generate informed questions for doctor visits
- Deciding whether symptoms warrant professional attention

However, they issued strong warnings against overreliance, privacy risks, and particularly dangerous errors in ** mental health scenarios** where bad advice could prove catastrophic.

The consensus remains clear: these tools shouldn’t substitute for professional medical advice, especially when wrong answers could affect diagnosis, triage, or treatment decisions.