Know2Guess: A Contamination-Aware Multi-Zone Benchmark for Knowledge-Boundary Evaluation in Large Language Models

wpnews.pro

cd /news/large-language-models/know2guess-a-contamination-aware-mul… · home › topics › large-language-models › article

[ARTICLE · art-40232] src=arxiv.org ↗ pub=2026-06-26T04:00Z topic=large-language-models verified=true sentiment=· neutral

Know2Guess: A Contamination-Aware Multi-Zone Benchmark for Knowledge-Boundary Evaluation in Large Language Models

Researchers introduced Know2Guess, a contamination-aware multi-zone benchmark with 1,200 items across five domains to evaluate large language models' ability to distinguish answerable knowledge from abstention-expected unknowns. Tests on FLAN-T5, Qwen2.5-Instruct, and Llama-3-Instruct models showed that Qwen2.5-3B-Instruct achieved the best reliability, but answer-expected zones remained difficult and calibration poor. The benchmark provides a reproducible protocol for auditing answerability, abstention, refusal, and contamination as distinct dimensions of LLM reliability.

read1 min views1 publishedJun 26, 2026

arXiv:2606.26101v1 Announce Type: new Abstract: Reliable evaluation of large language models should separate supported answering from unsupported guessing without conflating either with data contamination, prompt idiosyncrasy, or generic refusal behavior. We present a contamination-aware, multi-zone benchmark for measuring the transition from answerable knowledge to abstention-expected unknowns under frozen build-time labels. The benchmark contains 1,200 items across five domains, explicit abstention expectations, contamination-risk metadata, and dual parsing with an official strict parser plus a normalized robustness parser. We evaluate FLAN-T5, Qwen2.5-Instruct, and Llama-3-Instruct models under locked answer-or-abstain prompts, answer-only controls, and prompt-template variants. The benchmark is not solved by generic non-answer behavior: FLAN baselines remain weak on productive abstention, while stronger instruction-tuned models expose a selective but incomplete transition from answering to abstaining. Qwen2.5-3B-Instruct achieves the best overall reliability, but answer-expected zones remain difficult, calibration remains poor, and benign-item refusal persists. Prompt and parser robustness analyses preserve the main ranking and qualitative conclusions. The benchmark therefore provides a reproducible protocol for auditing answerability, abstention, refusal, and contamination as distinct but interacting dimensions of LLM reliability.The dataset is publicly available at https://github.com/renweimeng/Know2Guess-A-Contamination-Aware-Multi-Zone-Benchmark.

source & further reading

arxiv.org — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/know2guess-a-contaminati…

Read original on arxiv.org → arxiv.org/abs/2606.26101

mentioned entities

Know2Guess

FLAN-T5

Qwen2.5-Instruct

Llama-3-Instruct

Qwen2.5-3B-Instruct

arXiv

metadata

slugknow2guess-a-contamination-aware-multi-zone-benchmark-for-knowledge-boundary-in

topic#large-language-models

secondary3 topics

sentimentneutral

canonicalarxiv.org

navigation

← prevCan We Talk About the "AI/ML Eng…

── more in #large-language-models 4 stories · sorted by recency

arxiv.org · 26 Jun · #large-language-models

Helpfulness Hurts: Domain-Dependent Degradation of Mid-Trained Compassion Values Under Post-Training

fernandoi.cl · 26 Jun · #large-language-models

What happened after 2k people tried to hack my AI assistant

lesswrong.com · 26 Jun · #large-language-models

Research note on negated reward hacking

arxiv.org · 26 Jun · #large-language-models

Reducing Conversational Escalation in Large Language Model Dialogue with Nonviolent Communication Constraints

── more on @know2guess 3 stories trending now

wpnews · 19 Oct · #developer-tools

Windows Script to clean up and remove all ASUS software

wpnews · 28 May · #ai-startups

The Niche SaaS Opportunity Map 2026: Highly Demanded Subscribed Categories Beyond Mainstream

wpnews · 1 Nov · #developer-tools

Custom Zig Test Runner, better ouput, timing display, and support for special "tests:beforeAll" and "tests:afterAll" tests

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required