{"slug": "jetspec", "title": "JetSpec", "summary": "JetSpec, a new speculative decoding method, trains a causal parallel draft head over fused hidden states from a frozen target model, enabling lossless verification of candidate trees in one forward pass. On Qwen3-8B with budget 256, JetSpec achieves up to 9.64x speedup on MATH-500 and 4.58x on open-ended chat, translating to around 1000 tokens per second throughput on a single B200 GPU.", "body_md": "**TL;DR**: Speculative decoding hits a scaling ceiling: a larger draft budget helps only while acceptance stays high and drafting stays cheap. Prior draft heads face a dilemma: autoregressive drafters condition on each path but pay with tree depth, while block-diffusion drafters draft in one pass but score branches independently, creating plausible yet mutually inconsistent trees.\n\n**JetSpec** trains a\n\n**causal parallel draft head** over fused hidden states from a frozen target model, so candidate-tree scores follow the target’s own autoregressive factorization. The frozen target then verifies the full tree in one forward pass, losslessly. On Qwen3-8B, greedy decoding with budget 256, JetSpec reaches 9.64x on MATH-500 and 4.58x on open-ended chat, and these gains carry into real single-stream serving on JetSpec’s own engine with an average of around 1000 TPS throughput on MATH-500 using a single B200 GPU.", "url": "https://wpnews.pro/news/jetspec", "canonical_source": "https://haoailab.com/summary/parallel-tree-drafting/", "published_at": "2026-06-22 20:00:00+00:00", "updated_at": "2026-06-25 06:54:06.117415+00:00", "lang": "en", "topics": ["large-language-models", "machine-learning", "artificial-intelligence", "ai-infrastructure", "ai-research"], "entities": ["JetSpec", "Qwen3-8B", "MATH-500", "B200 GPU"], "alternates": {"html": "https://wpnews.pro/news/jetspec", "markdown": "https://wpnews.pro/news/jetspec.md", "text": "https://wpnews.pro/news/jetspec.txt", "jsonld": "https://wpnews.pro/news/jetspec.jsonld"}}