DFlash Speculative Decoding Drafts Whole Token Blocks in Parallel for Up to 15x Higher Throughput on NVIDIA Blackwell

wpnews.pro

cd /news/large-language-models/dflash-speculative-decoding-drafts-w… · home › topics › large-language-models › article

[ARTICLE · art-37431] src=marktechpost.com ↗ pub=2026-06-24T07:21Z topic=large-language-models verified=true sentiment=↑ positive

DFlash Speculative Decoding Drafts Whole Token Blocks in Parallel for Up to 15x Higher Throughput on NVIDIA Blackwell

UC San Diego researchers developed DFlash, a speculative decoding method that uses a lightweight block diffusion model to draft entire token blocks in parallel, achieving up to 6.08x speedup on Qwen3-8B and up to 15x throughput on NVIDIA Blackwell. The technique replaces autoregressive drafting with KV injection and is supported by SGLang, vLLM, and TensorRT-LLM.

read1 min views1 publishedJun 24, 2026

UC San Diego's DFlash replaces autoregressive drafting with a lightweight block diffusion model for speculative decoding. It drafts whole token blocks in a single forward pass and conditions on target hidden features through KV injection. The paper reports up to 6.08x lossless speedup on Qwen3-8B, while NVIDIA reports up to 15x throughput on Blackwell at fixed interactivity. DFlash ships 20 checkpoints and supports SGLang, vLLM, and TensorRT-LLM.

The post DFlash Speculative Decoding Drafts Whole Token Blocks in Parallel for Up to 15x Higher Throughput on NVIDIA Blackwell appeared first on MarkTechPost.

source & further reading

marktechpost.com — original article Mistral OCR 4 Brings Citation-Ready Structured Output to RAG, Agentic, and Enterprise Search Pipelines How to Use NVIDIA Canary-1B-v2 for ASR, Translation, and Automatic SRT Subtitle Export in Python Prime Intellect Releases prime-rl 0.6.0 to Train Trillion-Parameter MoE Models on Agentic RL Workloads

~/api · this article 200

$curl api.wpnews.pro/v1/news/dflash-speculative-decod…

Read original on marktechpost.com → www.marktechpost.com/2026/06/24/dflash-speculati…

mentioned entities

UC San Diego

NVIDIA

Qwen3-8B

SGLang

vLLM

TensorRT-LLM

Blackwell

metadata

slugdflash-speculative-decoding-drafts-whole-token-blocks-in-parallel-for-up-to-15x

topic#large-language-models

secondary3 topics

sentimentpositive

canonicalmarktechpost.com

navigation

← prevPhishing's New Growth Team: Meta…

next →Jeolla region eyed for new semic…

── more in #large-language-models 4 stories · sorted by recency

dev.to · 24 Jun · #large-language-models

I built an interactive 11-chapter guide to how LLM inference actually works

research.ibm.com · 23 Jun · #large-language-models

Running AI on mixed hardware for speed and affordability

vettedconsumer.com · 21 Jun · #large-language-models

Serving a Local LLM as an API: From Ollama's Endpoint to vLLM Throughput (and When to Rent Instead)

modal.com · 20 Jun · #large-language-models

Speculation Is All You Need

── more on @uc san diego 3 stories trending now

wpnews · 22 Jun · #generative-ai

Bain tests software takeover targets using vibecoding AI replicas

wpnews · 22 Jun · #large-language-models

MCP vs Skills: Why Skills Save Context Tokens

wpnews · 22 Jun · #artificial-intelligence

Value for Money Is All You Need

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required