DFlash Speculative Decoding Drafts Whole Token Blocks in Parallel for Up to 15x Higher Throughput on NVIDIA Blackwell

UC San Diego researchers developed DFlash, a speculative decoding method that uses a lightweight block diffusion model to draft entire token blocks in parallel, achieving up to 6.08x speedup on Qwen3-8B and up to 15x throughput on NVIDIA Blackwell. The technique replaces autoregressive drafting with KV injection and is supported by SGLang, vLLM, and TensorRT-LLM.

UC San Diego's DFlash replaces autoregressive drafting with a lightweight block diffusion model for speculative decoding. It drafts whole token blocks in a single forward pass and conditions on target hidden features through KV injection. The paper reports up to 6.08x lossless speedup on Qwen3-8B, while NVIDIA reports up to 15x throughput on Blackwell at fixed interactivity. DFlash ships 20 checkpoints and supports SGLang, vLLM, and TensorRT-LLM. The post DFlash Speculative Decoding Drafts Whole Token Blocks in Parallel for Up to 15x Higher Throughput on NVIDIA Blackwell https://www.marktechpost.com/2026/06/24/dflash-speculative-decoding-drafts-whole-token-blocks-in-parallel-for-up-to-15x-higher-throughput-on-nvidia-blackwell/ appeared first on MarkTechPost https://www.marktechpost.com .