JetFlow: Breaking the Scaling Ceiling of Speculative Decoding with Parallel Tree Drafting

wpnews.pro

cd /news/large-language-models/jetflow-breaking-the-scaling-ceiling… · home › topics › large-language-models › article

[ARTICLE · art-32072] src=arxiv.org ↗ pub=2026-06-18T04:00Z topic=large-language-models verified=true sentiment=↑ positive

JetFlow: Breaking the Scaling Ceiling of Speculative Decoding with Parallel Tree Drafting

Researchers from Hao AI Lab introduced JetFlow, a speculative decoding framework that breaks the scaling ceiling of autoregressive LLMs by combining one-forward drafting efficiency with branch-wise causal conditioning. JetFlow achieves up to 9.64x speedup on MATH-500 and 4.58x on conversational workloads using Qwen3 models on H100 GPUs, outperforming prior methods across math, coding, and chat benchmarks.

read1 min views1 publishedJun 18, 2026

arXiv:2606.18394v1 Announce Type: new Abstract: Speculative decoding (SD) accelerates autoregressive Large Language Models (LLMs) by drafting multiple tokens and verifying them in parallel, but it faces a scaling limitation: increasing the draft budget improves speed only when acceptance remains high and drafting overhead stays low. This ceiling has been difficult to break because prior head-based SD methods face a causality-efficiency dilemma. Autoregressive drafters produce path-conditioned candidates that are effective for tree speculative decoding with higher acceptance length, but their drafting cost grows with tree depth. Bidirectional block-diffusion drafters generate all positions in one pass, but their branch-agnostic marginals can form individually plausible yet mutually inconsistent trees, wasting budget and reducing acceptance. We propose JetFlow, a head-based SD framework that combines one-forward drafting efficiency with branch-wise causal conditioning. JetFlow trains a causal parallel draft head over fused hidden states from the frozen target model, producing candidate trees whose scores align with the target model's autoregressive factorization. This enables JetFlow to convert larger draft budgets into longer accepted prefixes and higher end-to-end speedup. Across math, coding, and chat benchmarks on dense and MoE Qwen3 models, JetFlow consistently outperforms bidirectional-head and tree-based SD baselines. On H100 GPUs, JetFlow achieves up to 9.64x speedup on MATH-500 and 4.58x on open-ended conversational workloads, with further latency gains demonstrated through vLLM integration under realistic serving loads. Our code and models are available at https://github.com/hao-ai-lab/JetFlow.

source & further reading

arxiv.org — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/jetflow-breaking-the-sca…

Read original on arxiv.org → arxiv.org/abs/2606.18394

mentioned entities

JetFlow

Hao AI Lab

Qwen3

MATH-500

H100 GPU

vLLM

metadata

slugjetflow-breaking-the-scaling-ceiling-of-speculative-decoding-with-parallel-tree

topic#large-language-models

secondary3 topics

sentimentpositive

canonicalarxiv.org

navigation

← prevIs AI Getting Quietly Dumber? A …

next →Most agentic AI projects in prod…

── more in #large-language-models 4 stories · sorted by recency

discuss.huggingface.co · 18 Jun · #large-language-models

Enigma Sound : Multi-Modal Emotion-to-Music Architecture Layout (Gradio + CNN/LSTM Walkthrough)

perceptiontheory.bearblog.dev · 18 Jun · #large-language-models

Improving a data pipeline with DSPy

searchenginejournal.com · 18 Jun · #large-language-models

Less Traffic, Better Leads: Is Google Fixing B2B Marketing?

arxiv.org · 18 Jun · #large-language-models

PragReST: Self-Reinforcing Counterfactual Reasoning for Pragmatic Language Understanding

── more on @jetflow 3 stories trending now

wpnews · 17 Jun · #developer-tools

CircleCI MCP Server: Debug Build Failures Without Leaving Your AI Coding Agent

wpnews · 17 Jun · #artificial-intelligence

How I Build Production AI Apps on Cloudflare with Claude Code

wpnews · 16 Jun · #large-language-models

I'm building CortexDB — an agent-native context database for AI agents

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required