ByteDance's "iLLaDA" is a diffusion language model that keeps up with Qwen2.5

wpnews.pro

cd /news/large-language-models/bytedance-s-illada-is-a-diffusion-la… · home › topics › large-language-models › article

[ARTICLE · art-41650] src=the-decoder.com ↗ pub=2026-06-27T07:48Z topic=large-language-models verified=true sentiment=· neutral

ByteDance's "iLLaDA" is a diffusion language model that keeps up with Qwen2.5

ByteDance and Renmin University researchers released iLLaDA, an 8B diffusion language model that matches Qwen2.5 on base benchmarks but lags after fine-tuning. The model, trained from scratch on 12 trillion tokens, outperforms prior diffusion models like LLaDA and Dream 7B, though its instruct version trails Qwen2.5 due to lack of reinforcement learning alignment.

read4 min views1 publishedJun 27, 2026

ByteDance's "iLLaDA" is a diffusion language model that keeps up with Qwen2.5

Researchers from Renmin University and Bytedance have released iLLaDA, an 8B language model that works differently from ChatGPT. It matches Qwen2.5 at the base level but falls behind after fine-tuning.

Nearly all well-known AI language models like GPT, Claude, or Qwen generate text autoregressively: word by word, left to right, with each new token depending only on the ones before it.

Diffusion language models take a different approach. They start with a sequence of placeholders, called masked tokens, and refine them across multiple passes in parallel. It's similar to how image models shape a picture from noise. Every position can attend to every other position at the same time, making the process bidirectional.

iLLaDA is part of a broader movement that includes Google. In June 2026, Google DeepMind released DiffusionGemma. That model generates text about four times faster via diffusion but scores worse on benchmarks like MMLU and code than the similarly sized autoregressive Gemma 4. Google recommends it for low-latency use cases, not quality-critical production.

DiffusionGemma takes a different approach. It's built on the Gemma 4 backbone, a 25-billion-parameter mixture-of-experts model that swaps only the generation method to prioritize speed. iLLaDA, short for "improved LLaDA," goes the other way. It's a dense 8B model trained from scratch, focused on quality.

The question behind all of this is whether a diffusion model built from the ground up can actually keep up with autoregressive models. A direct numerical comparison between the two is tough, though. Google uses partly different and harder benchmark variants, and DiffusionGemma plays in a different weight class.

What iLLaDA can do #

The team pretrained iLLaDA on 12 trillion tokens, up from 2.3 trillion for its predecessor LLaDA, and fine-tuned it for twelve epochs. According to the paper, iLLaDA-Base improves sharply over LLaDA, jumping 21.6 points on the reasoning test BBH, for example. On average it hits 63.9 points, edging just past the autoregressive Qwen2.5 7B at 63.3.

iLLaDA 8B	LLaDA 8B	Dream 7B	Qwen2.5 7B
Model
Diffusion	Diffusion	Diffusion	AR
Training tokens
12T	2.3T	18T + 0.6T	18T
General Tasks

MMLU	74.8
65.9	69.5	71.9
BBH	71.3
49.7	57.9	63.9
ARC-C	60.8
45.9	59.8	51.5
Hellaswag	76.6	70.5	73.3	79.0
Mathematics & Science

GSM8K	81.9
70.3	77.2	78.9
Math	38.4	31.4	39.6	41.1
Code

HumanEval	50.0	35.4	57.9
56.7
MBPP	57.8	40.0	56.2	63.6
Average
63.9
51.1	61.4	63.3

The comparison with the competing diffusion model Dream 7B also favors iLLaDA. Dream wasn't trained from scratch but fine-tuned from an existing Qwen2.5 checkpoint. iLLaDA still beats Dream on average, 63.9 vs. 61.4, even without the head start of a strong autoregressive base. Dream only holds a slight edge on coding benchmarks.

A gap remains at the instruct level. iLLaDA-Instruct scores 67.1 points while Qwen2.5 7B Instruct hits 77.1, with math and code driving most of the difference. The authors blame this on the extra reinforcement learning alignment in Qwen2.5, which iLLaDA lacks. In the paper's appendix, they also note that the model can get stuck in reasoning loops on harder tasks.

AI News Without the Hype – Curated by Humans

					Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section.				

					Subscribe now
Read on for the full picture.Subscribe for hype-free coverage.

Access to all THE DECODER articles.
Read without distractions – no Google ads.
Access to comments and community discussions.
Weekly AI newsletter.
6 times a year: “AI Radar” – deep dives on key AI topics.
Up to 25 % off on KI Pro online events.
Access to our full ten-year archive.
Get the latest AI news from The Decoder.

Subscribe to The Decoder

source & further reading

the-decoder.com — original article OpenAI's GPT-5.6 Sol launches to rival Claude Mythos under government access rules it calls unsustainable An AI model programmed nonstop for 19 days on a single MirrorCode task that cost $2,600 to run AI startup Lindy ditched Claude entirely for Deepseek, saving millions as cost pressure mounts on Anthropic

~/api · this article 200

$curl api.wpnews.pro/v1/news/bytedance-s-illada-is-a-…

Read original on the-decoder.com → the-decoder.com/bytedances-illada-is-a-diffusion…

mentioned entities

ByteDance

Renmin University

iLLaDA

Qwen2.5

Google DeepMind

DiffusionGemma

LLaDA

Dream 7B

metadata

slugbytedance-s-illada-is-a-diffusion-language-model-that-keeps-up-with-qwen2-5

topic#large-language-models

secondary2 topics

sentimentneutral

canonicalthe-decoder.com

navigation

← prevClever Prompts Are Cheap Now. Re…

next →Prompt Engineering: The Skill Th…

── more in #large-language-models 4 stories · sorted by recency

arxiv.org · 25 Jun · #large-language-models

Improved Large Language Diffusion Models

pub.towardsai.net · 27 Jun · #large-language-models

MCP (Model Context Protocol) Explained: The Standard That’s Quietly Changing How AI Agents Work

letsdatascience.com · 27 Jun · #large-language-models

Chuangxinzhong Tops ByteDance Jichuang 2.0 Agency Rankings

byteiota.com · 27 Jun · #large-language-models

Google’s AI Brain Drain Just Cost Alphabet $270B

── more on @bytedance 3 stories trending now

wpnews · 28 May · #ai-startups

The Niche SaaS Opportunity Map 2026: Highly Demanded Subscribed Categories Beyond Mainstream

wpnews · 1 Nov · #developer-tools

Custom Zig Test Runner, better ouput, timing display, and support for special "tests:beforeAll" and "tests:afterAll" tests

wpnews · 26 Jun · #large-language-models

The Wrapper Got Heavy: Why ChatGPT Clones Are Runtime Problems Now

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required