Disentangling Language Roles in Multilingual LLM Task Execution

wpnews.pro

cd /news/large-language-models/disentangling-language-roles-in-mult… · home › topics › large-language-models › article

[ARTICLE · art-16070] src=arxiv.org ↗ pub=2026-05-28T04:00Z topic=large-language-models verified=true sentiment=· neutral

Disentangling Language Roles in Multilingual LLM Task Execution

Researchers introduced MTM-Bench, a controlled benchmark that isolates three distinct language roles—instruction, content, and response—across English, Spanish, and Chinese to evaluate multilingual LLM task execution. Testing 20 models across all 27 possible language triplets, the study found that degradation is primarily driven by the response-language role, with a single response-slot mismatch accounting for most performance loss. The findings challenge the assumption that mismatch count alone predicts difficulty, revealing that task families fail through distinct channels and that semantic correctness does not ensure reliable multilingual execution.

read1 min views10 publishedMay 28, 2026

arXiv:2605.27649v1 Announce Type: new Abstract: Multilingual LLMs are increasingly used when instruction, source content, and required response languages do not coincide. Existing benchmarks have expanded multilingual instruction-following evaluation, but they rarely isolate these three roles within a fully crossed design. We introduce MTM-Bench, a controlled benchmark for language-conditioned task execution in which each instance is defined by a triplet ((L_{\text{instr}}, L_{\text{content}}, L_{\text{resp}})). Across English, Spanish, and Chinese, MTM-Bench enumerates all 27 triplets and contains 2{,}430 instances per model across semantic reversal, final-state extraction, and language purity with update realization. We evaluate 20 frontier and open-weight LLMs using decomposed metrics for semantic correctness, target-language adherence, constraint satisfaction, contamination ratio, and joint success, with scoring validated by a targeted human audit. The fully crossed design reveals that degradation is organized by the role a language occupies in the task structure, not merely by mismatch count. The response-language role is the dominant axis of variation, and a single response-slot mismatch accounts for most degradation. The response-only and full-mismatch comparison suggests that mismatch count is not a monotonic predictor of difficulty, with model-level ordering varying across systems. Task families fail through distinct channels, showing that semantic correctness alone does not capture reliable multilingual task execution.

source & further reading

arxiv.org — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/disentangling-language-r…

Read original on arxiv.org → arxiv.org/abs/2605.27649

mentioned entities

MTM-Bench

arXiv

metadata

slugdisentangling-language-roles-in-multilingual-llm-task-execution

topic#large-language-models

secondary3 topics

sentimentneutral

canonicalarxiv.org

navigation

← prevOpen House 2026 Day 1: real-time…

next →New poll points to possible Bece…

── more in #large-language-models 4 stories · sorted by recency

wired.com · 15 Jul · #large-language-models

AI Isn’t Smarter Than a Baby—Yet

dev.to · 15 Jul · #large-language-models

Your Docs Are Doing Your Marketing Now (Whether You Like It Or Not)

dev.to · 15 Jul · #large-language-models

The Trillion-Parameter RL Paper Is Really About Letting the Model Find the Workflow

byteiota.com · 15 Jul · #large-language-models

NVIDIA Nemotron TwoTower: 2.42x Faster LLM Inference

── more on @mtm-bench 3 stories trending now

wpnews · 27 May · #artificial-intelligence

How I Run Two Claude Accounts as One

wpnews · 23 May · #artificial-intelligence

AccessLens — a blind person's lanyard, powered by Gemma 4 on-device

wpnews · 21 May · #developer-tools

Antigravity CLI: A Hands-On Guide to Google's Terminal Coding Agent

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required