cd /news/large-language-models/t2d-bench-evidence-gated-evaluation-… · home topics large-language-models article
[ARTICLE · art-37252] src=arxiv.org ↗ pub= topic=large-language-models verified=true sentiment=· neutral

T2D-Bench: Evidence-Gated Evaluation of LLM Outputs for Type 2 Diabetes Using a Multi-Layer Clinical-Lifestyle Knowledge Graph

Researchers introduced T2D-Bench, a benchmark for evaluating LLM outputs on type 2 diabetes using a multi-layer clinical-lifestyle knowledge graph. Testing showed GPT-4o-mini and GPT-4o failed evidence-path checks in 35% and 33% of cases, respectively, highlighting unsupported omissions that the framework can detect and correct.

read1 min views2 publishedJun 24, 2026

arXiv:2606.24145v1 Announce Type: new Abstract: Large language models (LLMs) can produce clinically fluent recommendations for type 2 diabetes while failing to satisfy guideline constraints or explicitly justify lifestyle-related glycemic claims. We present T2D-Bench, a reproducible benchmark and evidence-gated evaluation framework for testing whether LLM outputs satisfy explicit, graph-checkable evidence requirements. T2D-Bench is built on a multi-layer clinical-lifestyle knowledge graph that combines a biomedical spine (UMLS, DrugBank, SIDER), computable ADA Standards of Care rules, and lifestyle knowledge connected through a mechanistic bridge to glycemic laboratory effects. Across 100 structured vignettes spanning diagnosis, medication safety, and adversarial lifestyle conflicts, baseline outputs failed benchmark-defined evidence-path checks in 35% of cases for GPT-4o-mini and 33% for GPT-4o. The evidence gate detects unsupported omissions and uses constrained revision to bring outputs into verifier-level compliance with benchmark-defined evidence requirements. These results show that computable evidence constraints can make unsupported clinical omissions explicit, measurable, and correctable in diabetes-focused LLM outputs.

── more in #large-language-models 4 stories · sorted by recency
── more on @t2d-bench 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/t2d-bench-evidence-g…] indexed:0 read:1min 2026-06-24 ·