{"slug": "possible-or-definite-a-benchmark-for-evaluating-diagnostic-uncertainty-in-text", "title": "Possible or Definite? A Benchmark for Evaluating Diagnostic Uncertainty Preservation in Clinical Text", "summary": "Researchers at an undisclosed institution have found that large language models (LLMs) poorly preserve diagnostic uncertainty in clinical text, altering phrases like \"possible pneumonia\" more than half the time, which could misguide medical decisions. The study, based on a new benchmark of 1,200 documents with 9,184 uncertainty annotations, reveals a failure mode not captured by standard metrics and raises safety concerns for LLM deployment in clinical workflows.", "body_md": "arXiv:2606.18471v1 Announce Type: new\nAbstract: Large language models (LLMs) are increasingly used for clinical text tasks such as summarization and revision. While most studies evaluate the fluency and coherence of LLM-generated text, whether LLMs correctly preserve diagnostic uncertainty remains underexplored. In clinical practice, phrases such as ``possible pneumonia'' communicate the strength of available evidence and directly guide decisions about follow-up testing and treatment. Altering these uncertainty expressions can change the clinical meaning entirely. In this paper, we systematically evaluated this problem in two steps. First, we constructed a benchmark of 1,200 clinical documents with 9,184 uncertainty annotations across five levels. Second, we evaluated three LLMs on this benchmark. Our results show that (1) LLMs preserve the original uncertainty cues poorly, often less than half the time; (2) LLMs struggle with nuanced distinctions between adjacent levels. This work reveals a failure mode not captured by standard evaluation metrics and provides implications for the safe deployment of LLMs in clinical workflows.", "url": "https://wpnews.pro/news/possible-or-definite-a-benchmark-for-evaluating-diagnostic-uncertainty-in-text", "canonical_source": "https://arxiv.org/abs/2606.18471", "published_at": "2026-06-18 04:00:00+00:00", "updated_at": "2026-06-18 04:25:02.082532+00:00", "lang": "en", "topics": ["large-language-models", "ai-safety", "natural-language-processing"], "entities": [], "alternates": {"html": "https://wpnews.pro/news/possible-or-definite-a-benchmark-for-evaluating-diagnostic-uncertainty-in-text", "markdown": "https://wpnews.pro/news/possible-or-definite-a-benchmark-for-evaluating-diagnostic-uncertainty-in-text.md", "text": "https://wpnews.pro/news/possible-or-definite-a-benchmark-for-evaluating-diagnostic-uncertainty-in-text.txt", "jsonld": "https://wpnews.pro/news/possible-or-definite-a-benchmark-for-evaluating-diagnostic-uncertainty-in-text.jsonld"}}