Speculative decoding: when and why it actually speeds up inference A team running a 70B Llama 3 fine-tune at 200 requests per second cut median time-to-first-token from 380 ms to 140 ms on the same hardware by implementing speculative decoding. The technique addresses memory-bound GPU utilization by running a small draft model to propose multiple tokens per cycle, which the target model then verifies in a single forward pass. The acceleration is provably exact—output distribution remains identical to running the target model alone—but requires the draft model to be closely aligned with the target to avoid wasted computation. Your chat endpoint serves 200 requests per second. The model is a 70B Llama 3 fine-tune. The GPU is sitting at 78% utilization, but the user-facing latency is still bad — 380 ms to first token on the median request, 1.1 s P99. The naive read is "we need a bigger box." The actual read is that the GPU is memory-bound , not compute-bound: most of the time is spent shipping weights and KV-cache state from HBM into the SMs, one token at a time, waiting for the next one. Speculative decoding is the technique that turns that one-token-at-a-time pipeline into a several-tokens-at-a-time pipeline without changing what the model actually samples. In our case it dropped p50 TTFT from 380 ms to 140 ms with the same hardware and the same 70B weights. Here's what it is, what the variants are, and when it stops being a free lunch. The throughput ceiling for an autoregressive LLM on a single GPU is set by the cost of moving one token's worth of logits and the next token's worth of attention state, not by FLOPs. Doubling the model's parameters roughly doubles the time-per-token on a memory-bound workload, but it does not double the FLOPs the SMs can do — the SMs are sitting idle. Speculative decoding addresses this by doing the heavy forward pass over the target model only every K tokens, and filling the gaps with a much smaller draft model that proposes K tokens in the time the target would have done one. The property people forget until it bites them: speculative decoding is an exact decoding accelerator. The output distribution is provably identical to running the target model alone, because every proposed token is verified by the target. If the target would have rejected the proposal, the algorithm resamples from a corrected distribution. If the target would have accepted it, the cost of generating it is paid once instead of K times. You don't trade output quality for speed. You trade VRAM and engineering effort for speed. The original formulation is from DeepMind's Chen, Borgeaud, Irving, Lespiau, and Sifre, "Accelerating Large Language Model Decoding with Speculative Sampling" https://arxiv.org/abs/2302.01318 Feb 2023 . The setup: The number of accepted tokens per cycle is a random variable. If the draft model is well-aligned with the target — close to it in distribution — the expected accepted length is high and the speedup is high. If they diverge different tokenizer offset, different training data, different fine-tune , most proposals get rejected and you're paying the draft cost for nothing. php flowchart LR A Prompt -- B Draft model Mq