cd /news/large-language-models/researchers-evaluate-llms-on-multili… · home topics large-language-models article
[ARTICLE · art-24741] src=letsdatascience.com pub= topic=large-language-models verified=true sentiment=· neutral

Researchers evaluate LLMs on multilingual vaccine questions

Researchers released a multilingual vaccine benchmark called VaxEval containing 1,886 multiple-choice questions covering 14 vaccines in English, Spanish, and Chinese, drawing from sources including the World Health Organization, CDC, UNICEF, and the American Medical Association. The study evaluated 13 large language models including GPT-4.5, GPT-4o, and Claude 3 Opus, finding that while models answered most questions accurately across all three languages, they made notable errors on vaccination schedules, contraindications, and eligibility criteria. The findings underscore the continued need for medical oversight when using LLMs for vaccine guidance.

read3 min publishedJun 12, 2026

News-Medical reports a new study that built a multilingual benchmark, VaxEval, with 1,886 multiple-choice vaccine questions covering 14 vaccines and three United Nations languages: English, Spanish, and Chinese. According to News-Medical, question sources included the World Health Organization, CDC, UNICEF, Africa CDC, the American Medical Association, and Immunize.org. The study assessed 13 large language models, including GPT-4.5, GPT-4o, GPT-4, GPT-3.5-Turbo, Claude 3 Opus, and Gemini 1.5 Pro, using zero-shot, few-shot, and chain-of-thought prompting, per News-Medical. The coverage states that models answered most items accurately across the three languages but made notable errors on vaccination schedules, contraindications, and eligibility, which the article frames as a reason why medical oversight remains necessary.

What happened

News-Medical reports that researchers released a multilingual vaccine benchmark named VaxEval containing 1,886 multiple-choice questions spanning 14 vaccines and three United Nations languages: English, Spanish, and Chinese. According to News-Medical, question material was sourced from the World Health Organization, the Centers for Disease Control and Prevention (CDC), UNICEF, Africa CDC, the American Medical Association (AMA), Immunize.org, and peer-reviewed literature. The article states the study evaluated 13 large language models, including GPT-4.5, GPT-4o, GPT-4, GPT-3.5-Turbo, Claude 3 Opus, Gemini 1.5 Pro, Llama-4 Maverick, DeepSeek-V3, Grok-3, Qwen 2.5, GLM-4, Reka Core, and Yi-Lightning, under zero-shot, few-shot, and chain-of-thought prompting methods, per News-Medical.

Technical details

Editorial analysis: The reported benchmark format, being multiple-choice and drawn from authoritative public-health sources, emphasizes factual recall and rule-based reasoning tasks such as schedules, contraindications, and eligibility criteria. Industry-pattern observations: Evaluations that mix factual recall with clinical-rule questions tend to differentiate models that memorize guidelines from those that can apply multi-step, threshold-based logic consistently across languages.

Context and significance

Editorial analysis: For practitioners, the finding that models "answered most items accurately" across English, Spanish, and Chinese, as reported by News-Medical, suggests substantial cross-lingual factual capability in current LLMs. However, the same coverage flags systematic weakness on clinical rules like timing, contraindications, and eligibility. Industry observers have repeatedly noted that such rule-based errors are higher-risk than isolated factual mistakes because they can materially affect care decisions when LLM outputs are used without verification.

What to watch

Editorial analysis: Observers should monitor whether future benchmarks release detailed per-question error types, per-language breakdowns, and calibration data. For tool builders and clinical informatics teams, the most relevant indicators will be model behavior on conditional logic tasks, consistency across prompting styles, and availability of provenance traces or references that tie assertions back to source guidance.

Reported limitations

News-Medical notes the study used multiple-choice questions and does not, in the article, quote the authors on publication venue or provide raw accuracy numbers. The coverage frames the conclusion as reinforcing the need for medical oversight when LLMs handle vaccine guidance.

Scoring Rationale #

The story presents a substantive benchmark that tests LLMs on clinically relevant vaccine knowledge across languages. It is notable for safety and deployment implications but not a paradigm shift in modeling, so it ranks as a notable research/practical finding.

Practice interview problems based on real data

1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.

Try 250 free problems

── more in #large-language-models 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/researchers-evaluate…] indexed:0 read:3min 2026-06-12 ·