{"slug": "dflash-speculative-decoding-drafts-whole-token-blocks-in-parallel-for-up-to-15x", "title": "DFlash Speculative Decoding Drafts Whole Token Blocks in Parallel for Up to 15x Higher Throughput on NVIDIA Blackwell", "summary": "UC San Diego researchers developed DFlash, a speculative decoding method that uses a lightweight block diffusion model to draft entire token blocks in parallel, achieving up to 6.08x speedup on Qwen3-8B and up to 15x throughput on NVIDIA Blackwell. The technique replaces autoregressive drafting with KV injection and is supported by SGLang, vLLM, and TensorRT-LLM.", "body_md": "UC San Diego's DFlash replaces autoregressive drafting with a lightweight block diffusion model for speculative decoding. It drafts whole token blocks in a single forward pass and conditions on target hidden features through KV injection. The paper reports up to 6.08x lossless speedup on Qwen3-8B, while NVIDIA reports up to 15x throughput on Blackwell at fixed interactivity. DFlash ships 20 checkpoints and supports SGLang, vLLM, and TensorRT-LLM.\n\nThe post [DFlash Speculative Decoding Drafts Whole Token Blocks in Parallel for Up to 15x Higher Throughput on NVIDIA Blackwell](https://www.marktechpost.com/2026/06/24/dflash-speculative-decoding-drafts-whole-token-blocks-in-parallel-for-up-to-15x-higher-throughput-on-nvidia-blackwell/) appeared first on [MarkTechPost](https://www.marktechpost.com).", "url": "https://wpnews.pro/news/dflash-speculative-decoding-drafts-whole-token-blocks-in-parallel-for-up-to-15x", "canonical_source": "https://www.marktechpost.com/2026/06/24/dflash-speculative-decoding-drafts-whole-token-blocks-in-parallel-for-up-to-15x-higher-throughput-on-nvidia-blackwell/", "published_at": "2026-06-24 07:21:10+00:00", "updated_at": "2026-06-24 07:48:01.073817+00:00", "lang": "en", "topics": ["large-language-models", "machine-learning", "ai-research", "ai-infrastructure"], "entities": ["UC San Diego", "NVIDIA", "Qwen3-8B", "SGLang", "vLLM", "TensorRT-LLM", "Blackwell"], "alternates": {"html": "https://wpnews.pro/news/dflash-speculative-decoding-drafts-whole-token-blocks-in-parallel-for-up-to-15x", "markdown": "https://wpnews.pro/news/dflash-speculative-decoding-drafts-whole-token-blocks-in-parallel-for-up-to-15x.md", "text": "https://wpnews.pro/news/dflash-speculative-decoding-drafts-whole-token-blocks-in-parallel-for-up-to-15x.txt", "jsonld": "https://wpnews.pro/news/dflash-speculative-decoding-drafts-whole-token-blocks-in-parallel-for-up-to-15x.jsonld"}}