{"slug": "know2guess-a-contamination-aware-multi-zone-benchmark-for-knowledge-boundary-in", "title": "Know2Guess: A Contamination-Aware Multi-Zone Benchmark for Knowledge-Boundary Evaluation in Large Language Models", "summary": "Researchers introduced Know2Guess, a contamination-aware multi-zone benchmark with 1,200 items across five domains to evaluate large language models' ability to distinguish answerable knowledge from abstention-expected unknowns. Tests on FLAN-T5, Qwen2.5-Instruct, and Llama-3-Instruct models showed that Qwen2.5-3B-Instruct achieved the best reliability, but answer-expected zones remained difficult and calibration poor. The benchmark provides a reproducible protocol for auditing answerability, abstention, refusal, and contamination as distinct dimensions of LLM reliability.", "body_md": "arXiv:2606.26101v1 Announce Type: new\nAbstract: Reliable evaluation of large language models should separate supported answering from unsupported guessing without conflating either with data contamination, prompt idiosyncrasy, or generic refusal behavior. We present a contamination-aware, multi-zone benchmark for measuring the transition from answerable knowledge to abstention-expected unknowns under frozen build-time labels. The benchmark contains 1,200 items across five domains, explicit abstention expectations, contamination-risk metadata, and dual parsing with an official strict parser plus a normalized robustness parser. We evaluate FLAN-T5, Qwen2.5-Instruct, and Llama-3-Instruct models under locked answer-or-abstain prompts, answer-only controls, and prompt-template variants. The benchmark is not solved by generic non-answer behavior: FLAN baselines remain weak on productive abstention, while stronger instruction-tuned models expose a selective but incomplete transition from answering to abstaining. Qwen2.5-3B-Instruct achieves the best overall reliability, but answer-expected zones remain difficult, calibration remains poor, and benign-item refusal persists. Prompt and parser robustness analyses preserve the main ranking and qualitative conclusions. The benchmark therefore provides a reproducible protocol for auditing answerability, abstention, refusal, and contamination as distinct but interacting dimensions of LLM reliability.The dataset is publicly available at https://github.com/renweimeng/Know2Guess-A-Contamination-Aware-Multi-Zone-Benchmark.", "url": "https://wpnews.pro/news/know2guess-a-contamination-aware-multi-zone-benchmark-for-knowledge-boundary-in", "canonical_source": "https://arxiv.org/abs/2606.26101", "published_at": "2026-06-26 04:00:00+00:00", "updated_at": "2026-06-26 04:04:20.125721+00:00", "lang": "en", "topics": ["large-language-models", "ai-research", "ai-safety", "ai-ethics"], "entities": ["Know2Guess", "FLAN-T5", "Qwen2.5-Instruct", "Llama-3-Instruct", "Qwen2.5-3B-Instruct", "arXiv"], "alternates": {"html": "https://wpnews.pro/news/know2guess-a-contamination-aware-multi-zone-benchmark-for-knowledge-boundary-in", "markdown": "https://wpnews.pro/news/know2guess-a-contamination-aware-multi-zone-benchmark-for-knowledge-boundary-in.md", "text": "https://wpnews.pro/news/know2guess-a-contamination-aware-multi-zone-benchmark-for-knowledge-boundary-in.txt", "jsonld": "https://wpnews.pro/news/know2guess-a-contamination-aware-multi-zone-benchmark-for-knowledge-boundary-in.jsonld"}}