Discourse-Role Labels as Presentation-Time Variables for Context Use in Language Models

wpnews.pro

cd /news/large-language-models/discourse-role-labels-as-presentatio… · home › topics › large-language-models › article

[ARTICLE · art-21130] src=arxiv.org ↗ pub=2026-06-04T04:00Z topic=large-language-models verified=true sentiment=· neutral

Discourse-Role Labels as Presentation-Time Variables for Context Use in Language Models

Researchers introduced a paired fixed-content probe over 500 MMLU-Pro items to test how discourse-role labels such as Instruction:, Reference:, and Example: affect language model adoption of misleading information. Across GPT-5.5, DeepSeek V4 Pro, Llama-3-8B-Instruct, and Qwen2.5-7B-Instruct, Misleading Adoption Rate shifted by 56-84 percentage points, with binding labels like Instruction: and Reference: driving high adoption while Example: consistently suppressed it. The findings show that context-utilization and reader-side RAG benchmarks should report and control wrapper labels, as presentation choices can significantly change measured reliance on supplied context.

read1 min views15 publishedJun 4, 2026

arXiv:2606.04109v1 Announce Type: new Abstract: Context-augmented language model systems often wrap supplied content with labels such as Reference:, Evidence:, Instruction:, Note:, or Example:, but the effect of these labels on reader-model behavior remains underexplored. We introduce a paired fixed-content probe over 500 MMLU-Pro items: each item receives the same misleading answer-bearing assertion under different discourse-role labels, and adoption is measured by whether the model outputs the injected wrong option. Across GPT-5.5, DeepSeek V4 Pro, Llama-3-8B-Instruct, and Qwen2.5-7B-Instruct, Misleading Adoption Rate shifts by 56-84 percentage points. Binding or source-like labels such as Instruction: and Reference: produce high adoption, whereas Example: consistently suppresses it. Paired tests, bootstrap intervals, final-instruction ablations, and Qwen final-step log-probability probes support a label-conditioned candidate preference. Boundary probes show where the effect weakens or persists: arithmetic tasks reduce adoption, passage-shaped external context preserves smaller label gaps, short-answer evaluation rules out option-letter copying, and nested-label conflicts suggest that illustrative framing can delimit adoption scope. A 200-case single-author manual audit confirms that the short-answer contrasts are stable under conservative adjudication. The resulting claim is bounded but practical: context-utilization and reader-side RAG benchmarks should report and control wrapper labels, because presentation choices can change measured reliance on supplied context.

source & further reading

arxiv.org — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/discourse-role-labels-as…

Read original on arxiv.org → arxiv.org/abs/2606.04109

mentioned entities

GPT-5.5

DeepSeek V4 Pro

Llama-3-8B-Instruct

Qwen2.5-7B-Instruct

MMLU-Pro

metadata

slugdiscourse-role-labels-as-presentation-time-variables-for-context-use-in-language

topic#large-language-models

secondary2 topics

sentimentneutral

canonicalarxiv.org

navigation

← prevHow FinOps Teams Trace Per-Reque…

next →SharkFlow Legal — devto

── more in #large-language-models 4 stories · sorted by recency

marktechpost.com · 19 Jul · #large-language-models

Perplexity AI Releases WANDR: An Open Benchmark Evaluating Research Agents That Must Search Wide And Deep

marktechpost.com · 19 Jul · #large-language-models

Kimi K3 vs DeepSeek V4 Pro vs GLM-5.2: Open Trillion-Scale MoE Models Compared on Benchmarks, License, and Serving Cost

huggingface.co · 19 Jul · #large-language-models

I built code-repair training data and shipped the eval so you can rerun it

twitter.com · 18 Jul · #large-language-models

Twitter user investigating potential Fable distillation in DeepSeek V4 Pro

── more on @gpt-5.5 3 stories trending now

wpnews · 26 May · #ai-agents

Think, Durable Objects, and the Real Shape of AI Applications

wpnews · 28 May · #ai-tools

Grok Build introduces /remember command for persistent context across coding sessions

wpnews · 19 Jul · #large-language-models

Claude Fable 5 Developer Guide: API, Pricing, Refusals

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required