{"slug": "bytedance-s-illada-is-a-diffusion-language-model-that-keeps-up-with-qwen2-5", "title": "ByteDance's \"iLLaDA\" is a diffusion language model that keeps up with Qwen2.5", "summary": "ByteDance and Renmin University researchers released iLLaDA, an 8B diffusion language model that matches Qwen2.5 on base benchmarks but lags after fine-tuning. The model, trained from scratch on 12 trillion tokens, outperforms prior diffusion models like LLaDA and Dream 7B, though its instruct version trails Qwen2.5 due to lack of reinforcement learning alignment.", "body_md": "ByteDance's \"iLLaDA\" is a diffusion language model that keeps up with Qwen2.5\n\n**Researchers from Renmin University and Bytedance have released iLLaDA, an 8B language model that works differently from ChatGPT. It matches Qwen2.5 at the base level but falls behind after fine-tuning.**\n\nNearly all well-known AI language models like GPT, Claude, or Qwen generate text autoregressively: word by word, left to right, with each new token depending only on the ones before it.\n\nDiffusion language models take a different approach. They start with a sequence of placeholders, called masked tokens, and refine them across multiple passes in parallel. It's similar to how image models shape a picture from noise. Every position can attend to every other position at the same time, making the process bidirectional.\n\niLLaDA is part of a broader movement that includes Google. In June 2026, Google DeepMind released [DiffusionGemma](https://the-decoder.com/googles-new-open-model-diffusiongemma-generates-text-from-noise-instead-of-word-by-word/). That model generates text about four times faster via diffusion but scores worse on benchmarks like MMLU and code than the similarly sized autoregressive [Gemma 4](https://the-decoder.com/googles-gemma-4-is-now-available-with-apache-2-0-licensing-for-the-first-time/). Google recommends it for low-latency use cases, not quality-critical production.\n\nDiffusionGemma takes a different approach. It's built on the Gemma 4 backbone, a 25-billion-parameter mixture-of-experts model that swaps only the generation method to prioritize speed. iLLaDA, short for \"improved LLaDA,\" goes the other way. It's a dense 8B model trained from scratch, focused on quality.\n\nThe question behind all of this is whether a diffusion model built from the ground up can actually keep up with autoregressive models. A direct numerical comparison between the two is tough, though. Google uses partly different and harder benchmark variants, and DiffusionGemma plays in a different weight class.\n\n## What iLLaDA can do\n\nThe team pretrained iLLaDA on 12 trillion tokens, up from 2.3 trillion for its predecessor LLaDA, and fine-tuned it for twelve epochs. According to the paper, iLLaDA-Base improves sharply over LLaDA, jumping 21.6 points on the reasoning test BBH, for example. On average it hits 63.9 points, edging just past the autoregressive Qwen2.5 7B at 63.3.\n\n| iLLaDA 8B | LLaDA 8B | Dream 7B | Qwen2.5 7B | |\n|---|---|---|---|---|\nModel |\nDiffusion | Diffusion | Diffusion | AR |\nTraining tokens |\n12T | 2.3T | 18T + 0.6T | 18T |\nGeneral Tasks |\n||||\n| MMLU | 74.8 |\n65.9 | 69.5 | 71.9 |\n| BBH | 71.3 |\n49.7 | 57.9 | 63.9 |\n| ARC-C | 60.8 |\n45.9 | 59.8 | 51.5 |\n| Hellaswag | 76.6 | 70.5 | 73.3 | 79.0 |\nMathematics & Science |\n||||\n| GSM8K | 81.9 |\n70.3 | 77.2 | 78.9 |\n| Math | 38.4 | 31.4 | 39.6 | 41.1 |\nCode |\n||||\n| HumanEval | 50.0 | 35.4 | 57.9 |\n56.7 |\n| MBPP | 57.8 | 40.0 | 56.2 | 63.6 |\nAverage |\n63.9 |\n51.1 | 61.4 | 63.3 |\n\nThe comparison with the competing diffusion model Dream 7B also favors iLLaDA. Dream wasn't trained from scratch but fine-tuned from an existing Qwen2.5 checkpoint. iLLaDA still beats Dream on average, 63.9 vs. 61.4, even without the head start of a strong autoregressive base. Dream only holds a slight edge on coding benchmarks.\n\nA gap remains at the instruct level. iLLaDA-Instruct scores 67.1 points while Qwen2.5 7B Instruct hits 77.1, with math and code driving most of the difference. The authors blame this on the extra reinforcement learning alignment in Qwen2.5, which iLLaDA lacks. In the paper's appendix, they also note that the model can get stuck in reasoning loops on harder tasks.\n\n```\nAI News Without the Hype – Curated by Humans\n\n\t\t\t\t\tSubscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive \"AI Radar\" frontier report six times a year, full archive access, and access to our comment section.\t\t\t\t\n\n\t\t\t\t\tSubscribe now\nRead on for the full picture.Subscribe for hype-free coverage.\n\nAccess to all THE DECODER articles.\nRead without distractions – no Google ads.\nAccess to comments and community discussions.\nWeekly AI newsletter.\n6 times a year: “AI Radar” – deep dives on key AI topics.\nUp to 25 % off on KI Pro online events.\nAccess to our full ten-year archive.\nGet the latest AI news from The Decoder.\n\nSubscribe to The Decoder\n```\n\n", "url": "https://wpnews.pro/news/bytedance-s-illada-is-a-diffusion-language-model-that-keeps-up-with-qwen2-5", "canonical_source": "https://the-decoder.com/bytedances-illada-is-a-diffusion-language-model-that-keeps-up-with-qwen2-5/", "published_at": "2026-06-27 07:48:29+00:00", "updated_at": "2026-06-27 08:05:53.010713+00:00", "lang": "en", "topics": ["large-language-models", "generative-ai", "ai-research"], "entities": ["ByteDance", "Renmin University", "iLLaDA", "Qwen2.5", "Google DeepMind", "DiffusionGemma", "LLaDA", "Dream 7B"], "alternates": {"html": "https://wpnews.pro/news/bytedance-s-illada-is-a-diffusion-language-model-that-keeps-up-with-qwen2-5", "markdown": "https://wpnews.pro/news/bytedance-s-illada-is-a-diffusion-language-model-that-keeps-up-with-qwen2-5.md", "text": "https://wpnews.pro/news/bytedance-s-illada-is-a-diffusion-language-model-that-keeps-up-with-qwen2-5.txt", "jsonld": "https://wpnews.pro/news/bytedance-s-illada-is-a-diffusion-language-model-that-keeps-up-with-qwen2-5.jsonld"}}