Researchers evaluate LLMs on multilingual vaccine questions

wpnews.pro

cd /news/large-language-models/researchers-evaluate-llms-on-multili… · home › topics › large-language-models › article

[ARTICLE · art-24741] src=letsdatascience.com ↗ pub=2026-06-12T03:50Z topic=large-language-models verified=true sentiment=· neutral

Researchers evaluate LLMs on multilingual vaccine questions

Researchers released a multilingual vaccine benchmark called VaxEval containing 1,886 multiple-choice questions covering 14 vaccines in English, Spanish, and Chinese, drawing from sources including the World Health Organization, CDC, UNICEF, and the American Medical Association. The study evaluated 13 large language models including GPT-4.5, GPT-4o, and Claude 3 Opus, finding that while models answered most questions accurately across all three languages, they made notable errors on vaccination schedules, contraindications, and eligibility criteria. The findings underscore the continued need for medical oversight when using LLMs for vaccine guidance.

read3 min views23 publishedJun 12, 2026

News-Medical reports a new study that built a multilingual benchmark, VaxEval, with 1,886 multiple-choice vaccine questions covering 14 vaccines and three United Nations languages: English, Spanish, and Chinese. According to News-Medical, question sources included the World Health Organization, CDC, UNICEF, Africa CDC, the American Medical Association, and Immunize.org. The study assessed 13 large language models, including GPT-4.5, GPT-4o, GPT-4, GPT-3.5-Turbo, Claude 3 Opus, and Gemini 1.5 Pro, using zero-shot, few-shot, and chain-of-thought prompting, per News-Medical. The coverage states that models answered most items accurately across the three languages but made notable errors on vaccination schedules, contraindications, and eligibility, which the article frames as a reason why medical oversight remains necessary.

What happened

News-Medical reports that researchers released a multilingual vaccine benchmark named VaxEval containing 1,886 multiple-choice questions spanning 14 vaccines and three United Nations languages: English, Spanish, and Chinese. According to News-Medical, question material was sourced from the World Health Organization, the Centers for Disease Control and Prevention (CDC), UNICEF, Africa CDC, the American Medical Association (AMA), Immunize.org, and peer-reviewed literature. The article states the study evaluated 13 large language models, including GPT-4.5, GPT-4o, GPT-4, GPT-3.5-Turbo, Claude 3 Opus, Gemini 1.5 Pro, Llama-4 Maverick, DeepSeek-V3, Grok-3, Qwen 2.5, GLM-4, Reka Core, and Yi-Lightning, under zero-shot, few-shot, and chain-of-thought prompting methods, per News-Medical.

Technical details

Editorial analysis: The reported benchmark format, being multiple-choice and drawn from authoritative public-health sources, emphasizes factual recall and rule-based reasoning tasks such as schedules, contraindications, and eligibility criteria. Industry-pattern observations: Evaluations that mix factual recall with clinical-rule questions tend to differentiate models that memorize guidelines from those that can apply multi-step, threshold-based logic consistently across languages.

Context and significance

Editorial analysis: For practitioners, the finding that models "answered most items accurately" across English, Spanish, and Chinese, as reported by News-Medical, suggests substantial cross-lingual factual capability in current LLMs. However, the same coverage flags systematic weakness on clinical rules like timing, contraindications, and eligibility. Industry observers have repeatedly noted that such rule-based errors are higher-risk than isolated factual mistakes because they can materially affect care decisions when LLM outputs are used without verification.

What to watch

Editorial analysis: Observers should monitor whether future benchmarks release detailed per-question error types, per-language breakdowns, and calibration data. For tool builders and clinical informatics teams, the most relevant indicators will be model behavior on conditional logic tasks, consistency across prompting styles, and availability of provenance traces or references that tie assertions back to source guidance.

Reported limitations

News-Medical notes the study used multiple-choice questions and does not, in the article, quote the authors on publication venue or provide raw accuracy numbers. The coverage frames the conclusion as reinforcing the need for medical oversight when LLMs handle vaccine guidance.

Scoring Rationale #

The story presents a substantive benchmark that tests LLMs on clinically relevant vaccine knowledge across languages. It is notable for safety and deployment implications but not a paradigm shift in modeling, so it ranks as a notable research/practical finding.

Practice interview problems based on real data

1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.

Try 250 free problems

source & further reading

letsdatascience.com — original article Oncoformer Predicts Cancer Risk From Routine Clinical Data SK hynix Reports Record 60.5 Trillion Won Q2 Operating Profit CENTCOM and UAE Establish Bilateral AI Task Force

~/api · this article 200

$curl api.wpnews.pro/v1/news/researchers-evaluate-llm…

Read original on letsdatascience.com → letsdatascience.com/news/researchers-evaluate-ll…

mentioned entities

GPT-4.5

GPT-4o

GPT-4

GPT-3.5-Turbo

Claude 3 Opus

Gemini 1.5 Pro

World Health Organization

CDC

metadata

slugresearchers-evaluate-llms-on-multilingual-vaccine-questions

topic#large-language-models

secondary4 topics

sentimentneutral

canonicalletsdatascience.com

navigation

← prevSophia NLU Home Assistant – On D…

next →Anthropic hiring architect outli…

── more in #large-language-models 4 stories · sorted by recency

arxiv.org · 29 Jul · #large-language-models

On the Use of LLMs for Specialised Terminology: A Good Alternative to Corpora?

dev.to · 29 Jul · #large-language-models

OpenAI’s GPT-5 Science Report Puts Human Stewardship at the Center of AI Research

dev.to · 29 Jul · #large-language-models

How do you measure something that gives a different answer every time?

machinebrief.com · 29 Jul · #large-language-models

A Human-in-the-Loop Corpus for LLM-Based Simplification of Scientific Summaries

── more on @gpt-4.5 3 stories trending now

wpnews · 16 Jul · #artificial-intelligence

Women entrepreneurs are less likely to leverage AI—but more likely to benefit from it

wpnews · 28 Jul · #large-language-models

How to Download and Run Kimi K3 Open Weights

wpnews · 28 Jul · #artificial-intelligence

How Claude Code and VS Code turned Anthropic from a safety lab into a developer phenomenon

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required