ByteDance's "iLLaDA" is a diffusion language model that keeps up with Qwen2.5

ByteDance and Renmin University researchers released iLLaDA, an 8B diffusion language model that matches Qwen2.5 on base benchmarks but lags after fine-tuning. The model, trained from scratch on 12 trillion tokens, outperforms prior diffusion models like LLaDA and Dream 7B, though its instruct version trails Qwen2.5 due to lack of reinforcement learning alignment.

ByteDance's "iLLaDA" is a diffusion language model that keeps up with Qwen2.5 Researchers from Renmin University and Bytedance have released iLLaDA, an 8B language model that works differently from ChatGPT. It matches Qwen2.5 at the base level but falls behind after fine-tuning. Nearly all well-known AI language models like GPT, Claude, or Qwen generate text autoregressively: word by word, left to right, with each new token depending only on the ones before it. Diffusion language models take a different approach. They start with a sequence of placeholders, called masked tokens, and refine them across multiple passes in parallel. It's similar to how image models shape a picture from noise. Every position can attend to every other position at the same time, making the process bidirectional. iLLaDA is part of a broader movement that includes Google. In June 2026, Google DeepMind released DiffusionGemma https://the-decoder.com/googles-new-open-model-diffusiongemma-generates-text-from-noise-instead-of-word-by-word/ . That model generates text about four times faster via diffusion but scores worse on benchmarks like MMLU and code than the similarly sized autoregressive Gemma 4 https://the-decoder.com/googles-gemma-4-is-now-available-with-apache-2-0-licensing-for-the-first-time/ . Google recommends it for low-latency use cases, not quality-critical production. DiffusionGemma takes a different approach. It's built on the Gemma 4 backbone, a 25-billion-parameter mixture-of-experts model that swaps only the generation method to prioritize speed. iLLaDA, short for "improved LLaDA," goes the other way. It's a dense 8B model trained from scratch, focused on quality. The question behind all of this is whether a diffusion model built from the ground up can actually keep up with autoregressive models. A direct numerical comparison between the two is tough, though. Google uses partly different and harder benchmark variants, and DiffusionGemma plays in a different weight class. What iLLaDA can do The team pretrained iLLaDA on 12 trillion tokens, up from 2.3 trillion for its predecessor LLaDA, and fine-tuned it for twelve epochs. According to the paper, iLLaDA-Base improves sharply over LLaDA, jumping 21.6 points on the reasoning test BBH, for example. On average it hits 63.9 points, edging just past the autoregressive Qwen2.5 7B at 63.3. | iLLaDA 8B | LLaDA 8B | Dream 7B | Qwen2.5 7B | | |---|---|---|---|---| Model | Diffusion | Diffusion | Diffusion | AR | Training tokens | 12T | 2.3T | 18T + 0.6T | 18T | General Tasks | |||| | MMLU | 74.8 | 65.9 | 69.5 | 71.9 | | BBH | 71.3 | 49.7 | 57.9 | 63.9 | | ARC-C | 60.8 | 45.9 | 59.8 | 51.5 | | Hellaswag | 76.6 | 70.5 | 73.3 | 79.0 | Mathematics & Science | |||| | GSM8K | 81.9 | 70.3 | 77.2 | 78.9 | | Math | 38.4 | 31.4 | 39.6 | 41.1 | Code | |||| | HumanEval | 50.0 | 35.4 | 57.9 | 56.7 | | MBPP | 57.8 | 40.0 | 56.2 | 63.6 | Average | 63.9 | 51.1 | 61.4 | 63.3 | The comparison with the competing diffusion model Dream 7B also favors iLLaDA. Dream wasn't trained from scratch but fine-tuned from an existing Qwen2.5 checkpoint. iLLaDA still beats Dream on average, 63.9 vs. 61.4, even without the head start of a strong autoregressive base. Dream only holds a slight edge on coding benchmarks. A gap remains at the instruct level. iLLaDA-Instruct scores 67.1 points while Qwen2.5 7B Instruct hits 77.1, with math and code driving most of the difference. The authors blame this on the extra reinforcement learning alignment in Qwen2.5, which iLLaDA lacks. In the paper's appendix, they also note that the model can get stuck in reasoning loops on harder tasks. AI News Without the Hype – Curated by Humans Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section. Subscribe now Read on for the full picture.Subscribe for hype-free coverage. Access to all THE DECODER articles. Read without distractions – no Google ads. Access to comments and community discussions. Weekly AI newsletter. 6 times a year: “AI Radar” – deep dives on key AI topics. Up to 25 % off on KI Pro online events. Access to our full ten-year archive. Get the latest AI news from The Decoder. Subscribe to The Decoder