{"slug": "t2d-bench-evidence-gated-evaluation-of-llm-outputs-for-type-2-diabetes-using-a", "title": "T2D-Bench: Evidence-Gated Evaluation of LLM Outputs for Type 2 Diabetes Using a Multi-Layer Clinical-Lifestyle Knowledge Graph", "summary": "Researchers introduced T2D-Bench, a benchmark for evaluating LLM outputs on type 2 diabetes using a multi-layer clinical-lifestyle knowledge graph. Testing showed GPT-4o-mini and GPT-4o failed evidence-path checks in 35% and 33% of cases, respectively, highlighting unsupported omissions that the framework can detect and correct.", "body_md": "arXiv:2606.24145v1 Announce Type: new\nAbstract: Large language models (LLMs) can produce clinically fluent recommendations for type 2 diabetes while failing to satisfy guideline constraints or explicitly justify lifestyle-related glycemic claims. We present T2D-Bench, a reproducible benchmark and evidence-gated evaluation framework for testing whether LLM outputs satisfy explicit, graph-checkable evidence requirements. T2D-Bench is built on a multi-layer clinical-lifestyle knowledge graph that combines a biomedical spine (UMLS, DrugBank, SIDER), computable ADA Standards of Care rules, and lifestyle knowledge connected through a mechanistic bridge to glycemic laboratory effects. Across 100 structured vignettes spanning diagnosis, medication safety, and adversarial lifestyle conflicts, baseline outputs failed benchmark-defined evidence-path checks in 35% of cases for GPT-4o-mini and 33% for GPT-4o. The evidence gate detects unsupported omissions and uses constrained revision to bring outputs into verifier-level compliance with benchmark-defined evidence requirements. These results show that computable evidence constraints can make unsupported clinical omissions explicit, measurable, and correctable in diabetes-focused LLM outputs.", "url": "https://wpnews.pro/news/t2d-bench-evidence-gated-evaluation-of-llm-outputs-for-type-2-diabetes-using-a", "canonical_source": "https://arxiv.org/abs/2606.24145", "published_at": "2026-06-24 04:00:00+00:00", "updated_at": "2026-06-24 04:30:53.412675+00:00", "lang": "en", "topics": ["large-language-models", "ai-safety", "ai-research"], "entities": ["T2D-Bench", "GPT-4o-mini", "GPT-4o", "UMLS", "DrugBank", "SIDER", "ADA"], "alternates": {"html": "https://wpnews.pro/news/t2d-bench-evidence-gated-evaluation-of-llm-outputs-for-type-2-diabetes-using-a", "markdown": "https://wpnews.pro/news/t2d-bench-evidence-gated-evaluation-of-llm-outputs-for-type-2-diabetes-using-a.md", "text": "https://wpnews.pro/news/t2d-bench-evidence-gated-evaluation-of-llm-outputs-for-type-2-diabetes-using-a.txt", "jsonld": "https://wpnews.pro/news/t2d-bench-evidence-gated-evaluation-of-llm-outputs-for-type-2-diabetes-using-a.jsonld"}}