cd /news/large-language-models/bytedance-s-illada-is-a-diffusion-la… · home topics large-language-models article
[ARTICLE · art-41650] src=the-decoder.com ↗ pub= topic=large-language-models verified=true sentiment=· neutral

ByteDance's "iLLaDA" is a diffusion language model that keeps up with Qwen2.5

ByteDance and Renmin University researchers released iLLaDA, an 8B diffusion language model that matches Qwen2.5 on base benchmarks but lags after fine-tuning. The model, trained from scratch on 12 trillion tokens, outperforms prior diffusion models like LLaDA and Dream 7B, though its instruct version trails Qwen2.5 due to lack of reinforcement learning alignment.

read4 min views1 publishedJun 27, 2026
ByteDance's "iLLaDA" is a diffusion language model that keeps up with Qwen2.5
Image: The Decoder

ByteDance's "iLLaDA" is a diffusion language model that keeps up with Qwen2.5

Researchers from Renmin University and Bytedance have released iLLaDA, an 8B language model that works differently from ChatGPT. It matches Qwen2.5 at the base level but falls behind after fine-tuning.

Nearly all well-known AI language models like GPT, Claude, or Qwen generate text autoregressively: word by word, left to right, with each new token depending only on the ones before it.

Diffusion language models take a different approach. They start with a sequence of placeholders, called masked tokens, and refine them across multiple passes in parallel. It's similar to how image models shape a picture from noise. Every position can attend to every other position at the same time, making the process bidirectional.

iLLaDA is part of a broader movement that includes Google. In June 2026, Google DeepMind released DiffusionGemma. That model generates text about four times faster via diffusion but scores worse on benchmarks like MMLU and code than the similarly sized autoregressive Gemma 4. Google recommends it for low-latency use cases, not quality-critical production.

DiffusionGemma takes a different approach. It's built on the Gemma 4 backbone, a 25-billion-parameter mixture-of-experts model that swaps only the generation method to prioritize speed. iLLaDA, short for "improved LLaDA," goes the other way. It's a dense 8B model trained from scratch, focused on quality.

The question behind all of this is whether a diffusion model built from the ground up can actually keep up with autoregressive models. A direct numerical comparison between the two is tough, though. Google uses partly different and harder benchmark variants, and DiffusionGemma plays in a different weight class.

What iLLaDA can do #

The team pretrained iLLaDA on 12 trillion tokens, up from 2.3 trillion for its predecessor LLaDA, and fine-tuned it for twelve epochs. According to the paper, iLLaDA-Base improves sharply over LLaDA, jumping 21.6 points on the reasoning test BBH, for example. On average it hits 63.9 points, edging just past the autoregressive Qwen2.5 7B at 63.3.

iLLaDA 8B LLaDA 8B Dream 7B Qwen2.5 7B
Model
Diffusion Diffusion Diffusion AR
Training tokens
12T 2.3T 18T + 0.6T 18T
General Tasks
MMLU 74.8
65.9 69.5 71.9
BBH 71.3
49.7 57.9 63.9
ARC-C 60.8
45.9 59.8 51.5
Hellaswag 76.6 70.5 73.3 79.0
Mathematics & Science
GSM8K 81.9
70.3 77.2 78.9
Math 38.4 31.4 39.6 41.1
Code
HumanEval 50.0 35.4 57.9
56.7
MBPP 57.8 40.0 56.2 63.6
Average
63.9
51.1 61.4 63.3

The comparison with the competing diffusion model Dream 7B also favors iLLaDA. Dream wasn't trained from scratch but fine-tuned from an existing Qwen2.5 checkpoint. iLLaDA still beats Dream on average, 63.9 vs. 61.4, even without the head start of a strong autoregressive base. Dream only holds a slight edge on coding benchmarks.

A gap remains at the instruct level. iLLaDA-Instruct scores 67.1 points while Qwen2.5 7B Instruct hits 77.1, with math and code driving most of the difference. The authors blame this on the extra reinforcement learning alignment in Qwen2.5, which iLLaDA lacks. In the paper's appendix, they also note that the model can get stuck in reasoning loops on harder tasks.

AI News Without the Hype – Curated by Humans

					Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section.				

					Subscribe now
Read on for the full picture.Subscribe for hype-free coverage.

Access to all THE DECODER articles.
Read without distractions – no Google ads.
Access to comments and community discussions.
Weekly AI newsletter.
6 times a year: “AI Radar” – deep dives on key AI topics.
Up to 25 % off on KI Pro online events.
Access to our full ten-year archive.
Get the latest AI news from The Decoder.

Subscribe to The Decoder
── more in #large-language-models 4 stories · sorted by recency
── more on @bytedance 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/bytedance-s-illada-i…] indexed:0 read:4min 2026-06-27 ·