{"slug": "the-economics-of-speculative-decoding", "title": "The economics of speculative decoding", "summary": "Speculative decoding, a lossless inference optimisation that predicts future tokens to reduce latency, faces new economic constraints as modern mixture-of-experts (MoE) architectures replace dense transformers. MoE layers increase memory-bound operations per token, widening the \"free token\" band but requiring larger batch sizes to achieve the same compute efficiency, fundamentally altering when and how far ahead speculation remains profitable.", "body_md": "# The economics of speculative decoding\n\n*The Bulls and Bears in the Market*(1879), via\n\n[Wikimedia Commons](https://commons.wikimedia.org/wiki/File:The_Bulls_and_Bears_in_the_Market.jpeg).\n\nSpeculative decoding is one of the cleanest performance wins in inference optimisation: it’s lossless, it hits decode latency when not much else does, and in its standard formulation it’s simple and elegant.\n\nIt works by looking forwards: speculative decoding takes a position on what tokens will come next. For dense transformers the bet is riskless: accepted tokens pay off, rejected tokens cost nothing, a clean arbitrage on spare memory bandwidth.\n\nA burst of research activity has recently pushed the envelope on how far\nforwards we can take that bet, for example [Eagle\n3.1](https://vllm.ai/blog/2026-05-26-eagle-3-1),\n[DFlash](https://arxiv.org/html/2602.06036v2),\n[SSD](https://arxiv.org/html/2603.03251v3).\n\nThis post looks at two architectural shifts that have changed the underlying economics of speculation: what mixture-of-experts routing does to the decode roofline, and how compressed attention takes away the slack that used to make speculated tokens free.\n\nThen it works through what they mean for when, and how far ahead, we should speculate.\n\n## The expert tax\n\nFFN layers in older, dense transformers (like the venerable\n[Llama](https://huggingface.co/meta-llama/Meta-Llama-3-70B)I wrote about this model before, [here](https://fergusfinn.com/blog/inference-arithmetic/). series) have a\nsimple [roofline](https://en.wikipedia.org/wiki/Roofline_model) with batch size: arithmetic intensity climbs linearly with\nbatch size as weights get reused across the batch, then flattens onto the\ncompute ceiling.\n\nThe win for speculative decoding is clear. If you’re on the slope of the roofline you’re memory bound, and speculated tokens increase the amount of compute you’re doing without increasing the memory transfer. So both accepted & rejected tokens are free until they push you over the knee.\n\nModern models almost invariablyWith some interesting [exceptions](https://huggingface.co/mistralai/Mistral-Medium-3.5-128B). use\n[mixture-of-experts](https://huggingface.co/blog/moe) (MoE) layers in place of\nsimple dense FFNs. Each token passes first through a ‘routing’ layer, which\norders the relevant experts by affinity. The token hidden state is sent to the\ntop experts, then the results are recombined.\n\nThis routing means that the arithmetic intensity of the MoE layer can depend on the actual content of the hidden state inputs, not just the shape. In practice, one training objective (for training and large scale inference reasons) is to keep the experts balanced — that is, if tokens come in, each expert of total should process a fraction of the total.\n\nFrom here on, take DeepSeek-V4-Flash as an example: routed experts of , plus one always-on shared expert. The intensity-vs-batch curve changes in two ways vs. a dense equivalent.\n\n**Barely amortising at the bottom.** At small batch each new token added to the batch tends to activate fresh experts (at batch 2 the chance the new token’s experts already match is small), so it drags its own weights across the bus and gets little to no amortisation. The intensity leaves the origin at only half its eventual slope, so a token added here, speculated or not, pays close to full freight for its experts.**Shallower slope / distant knee, same ceiling.** Once every expert is being triggered, the MoE line climbs more gently, reaching the same ceiling only at a far larger batch. The free-token band is much wider.\n\nDense climbs steeply; the MoE is shallower by a factor . The shaded region under each line is the memory-bound stretch, where speculated tokens are roughly free; it runs much wider for the MoE. Assuming uniform routing to experts, which is a good assumption for DeepSeek, and single-node deployment (expert parallelism changes stuff a bit). We’re using the fp4 threshold since DeepSeek’s experts are natively mxfp4. Not visible on this plot, because of the shallowness of the MoE roofline: the curve between and ~, where new experts are being brought in.\n\nThe whole idea of speculative decoding is to amortise the weight transfer in autoregressive decoding between multiple steps. Notably, the chart tells us at batch size this barely works for the MoE layers. But, as batch size grows past this low region, there’s a much larger space in which speculative decoding might pay.\n\nThe implications for speculative decoding are that:\n\n- The win when speculative tokens are accepted is no longer so big\n- The penalty when speculative tokens are rejected is no longer zero.\n- Both the win & the penalty from speculative decoding changes nonlinearly with batch size.\n\n## The changing face of attention\n\nThe ‘expert tax’ at low batch size is part of the story that’s changed. The\nother part is attention. A recap: the term for the ratio of FLOPs to memory transferred\nfor an operation is *arithmetic intensity*. You can figure out whether an\noperation is memory bound or compute bound by comparing its arithmetic\nintensity to the ratio of *available* flops and memory bandwidth, for the\nhardware you’ll run the operation on.\n\nGenerically, we can write the arithmetic intensity of the attention operation as:\n\nfor query tokens over context tokens, where is the (bf16) FLOPs per query-context pair and , are the bytes transferred per context and query token.\n\nFor models in the Llama-3 vein, at decode, where , this goes as For pure MHA, it truly goes as with no constant. Llama-3 is not quite so optimisation-blind so it uses GQA, which makes it something like .. The ridge for a B200 is FLOPs/byte (bf16). Assuming we don’t have a speculator that can produce hundreds of correct tokens at a time (if we did, we might as well just use it in place of the target model), pretty much any reasonable number of speculation tokens you verify wring more compute out of a KV read you had already paid for. This means speculation can still be a win for global throughput at high batch sizes, even when the GEMMs hit their ridgepoint, something that maybe goes underappreciated.\n\nThe trend in attention implementations, driven by the binding pressure of KV\ncache sizes, has been KV cache compression — driving down , the bytes\nstored and transferred per token in the sequence, and often correspondingly\n. One successful attention implementation, DeepSeek’s\n[Multihead Latent Attention](https://arxiv.org/abs/2412.19437) (MLA) does this by\nstoring only a single latent vector per token, for all the attention heads The architecture we’ve been discussing is DeepSeek-V4, which is to\nAttention is All You Need MHA what [ASML’s EUV\nmachines](https://www.asml.com/en/products/euv-lithography-systems) are to\n[spirographs](https://www.amazon.co.uk/Spirograph-Including-Precision-Spiro-Putty-Creative/dp/B0B6GWVST9/).\nIts variants get a full breakdown in the [appendix](#v4-hca-and-csa). The\nupshot is the same qualitative shape as MLA, but the exact thresholds move\nwith the compression ratio and sequence length. For the calculations on MLA +\nDeepseek’s attention variants, see the [appendix](#mla)..\n\nThe arithmetic intensity is:\n\n| 1 | 2 | 4 | 8 | |\n|---|---|---|---|---|\n| 512 | 193 | 322 | 484 | 645 |\n| 1,024 | 215 | 387 | 645 | 967 |\n| 8,192 | 238 | 469 | 910 | 1719 |\n| 16,384 | 240 | 476 | 938 | 1820 |\n| 1,048,576 | 242 | 483 | 967 | 1932 |\n\nCompare the bf16 ridge ( FLOP/byte)Attention stays in bf16 even when the FFN GEMMs and the KV cache itself drop to fp8/fp4, because the softmax is more sensitive to the precision.. Bold is compute-bound. Decode () is just memory-bound at every context length.\n\nAny number of speculation tokens makes MLA immediately compute bound!It’s a little more subtle than this. MLA has two algebraically-equivalent formulations: an MQA one (a single latent KV shared across all heads — what the table assumes), used at decode, and an MHA one (the latent up-projected to per-head K/V), used in prefill. The MHA form’s attention runs at intensity rather than , so it stays memory-bound far longer — but only by up-projecting the whole KV context to per-head K/V, a fixed cost that amortises across the attending tokens and so only pays for itself past . Speculation never gets near that (we assume tokens), so we’re always in the MQA regime, where the table holds. So there’s no free lunch. When you speculate with DeepSeek, you pay close to full price for your speculated tokens.\n\n## How to price a speculated token\n\nWe’ve talked about two different things that have changed the cost landscape for speculative decoding.\n\nWhen figuring out how well speculation is going to work *as a system*, there\nare two things that matter:\n\n- The extra cost that comes from running the draft model. This cost can come\nto bear in throughput (the FLOPs used on the draft model could have been\nused on the original model), and in latency (i.e. in the standard\nformulation the draft model has to run synchronously in the forward pass)Realistically the draft model will also have its own roofline, which adds\nstraightforwardly to the per token marginal cost.\n[Eagle](https://vllm.ai/blog/2026-05-26-eagle-3-1)/ MTP use a fast autoregressive model conditioned on the hidden states of the base model,[DFlash](https://arxiv.org/html/2602.06036v2)uses bidirectional attention with a masked language modelling objective..\n\n- How much each token costs to verify. Accepted, we book it as profit over generating the token anew; rejected, a tax for having speculated. For a dense, memory-bound model this is roughly zero. That’s no longer quite true — and not just for MoE, since the compressed attention eats the same slack from the other side.\n\nIn order to choose how to build a speculation system, we need to pick parameters that balance the value we get from new tokens, with the cost we pay for producing, then verifying those tokens.\n\nThe chart tells us how much a new speculated token costs to produce + verify, relative to a new token. is break-even. Toggle the components to see how the different parts of the model contribute.\n\n## How far ahead should we speculate\n\nThe cost model tells us that we need to be careful with speculated tokens, because they’re no longer free. Speculated tokens that are expensive to verify need to be likely to be accepted, otherwise they don’t pay their way relative to tokens generated anew. To figure out how many speculated tokens to work with, we need a model of acceptance.\n\nPick the simplest speculation model: each draft token gets accepted i.i.d. with\nfixed probability , draft length Constant per-position is the optimistic case; real acceptance\ndecays with draft depth in some complicated way, and also depends on the\ncontent & length of the preceding sequence. So read as an upper\nbound. Drafter cost would add to ; I’m holding it fixed here. This is just a\nfinite [geometric series](https://en.wikipedia.org/wiki/Geometric_series).. The expected number of\ntokens committed by one verifier pass goes like:\n\nThe cost of verifying tokens in the target model is:\n\nWriting the no-speculation decode cost as , the throughput speedup relative to ordinary decode is then:\n\nThe mental model is usually that the denominator is roughly constant with up until the model becomes compute bound. But that’s nowhere near true any more. At small batch sizes, there are parameter regions where you’re better off not speculating at all!\n\nOptimising the speedup for gives us everywhere that speculation is useful for DeepSeek-V4-Flash:\n\n## Conclusion\n\nThe picture of speculative decoding that I had in my head before running the maths was the one from the original paper, valid for dense transformers: speculative decoding works for small batch size, and it’s a big win up to the point at which the GEMMs are compute bound. After that point, it’s still a win, because attention’s arithmetic intensity doesn’t saturate with batch size.\n\nArchitectural innovations have rendered both of those notions false. The MoE tax can leave us with a gutter in which the optimal speculative decode length drops to zero. And MLA is compute bound with a single speculation token.\n\nSome of this maths is exaggerated by considering a regime that we don’t sit in\nat scale. If we’re serving a MoE model, the win from distributing those experts\nacross nodes (expert parallel) is big, even considered per GPU. And fewer\nexperts per GPU lowers the MoE tax, since with fewer experts per GPU, a larger\nfraction are brought in per token. On the other hand each speculated token does\nthen have to pay a *communications* tax instead, which does not get amortized.\n\nThere are a few directions to take this further.\n\nFirst, the model we’ve built of the cost of a marginal speculation token makes it easy to tune production deployments without expensive trial and error. If we want to sit at batch size and average sequence length , and we’re using MTP with an average acceptance rate , then we can just read off the value of we should set.\n\nMore interestingly, the architecture has raised the stakes on every speculation decision: rejected tokens cost more than they used to, and accepted ones pay less. With the stakes higher and the optimal moving with load, profile-guided adaptive speculation is worth far more than it once was. It points us towards better ways of choosing how many tokens to propose dynamically per scheduler step, and towards more sophisticated real time decision making as to whether to run the verifier on those tokens.\n\nFor more on these ideas, watch this space.\n\n## Appendix: the roofline maths\n\nEverything here is DeepSeek-V4-Flash. Each part is the maths behind one of the charts above: the MoE roofline under the expert-stack chart, the attention roofline under the intensity table, and the marginal-cost curve under the ledger.\n\n### The MoE roofline\n\nRouting sends each token to of routed experts (, for\nDeepSeek-V4-Flash), and the block also has one always-on shared expert. A\nforward pass over tokens therefore issues routed expert picks plus\nshared-expert applications. The compute is set by those expert applications,\nbut the routed weight traffic is set by how many *distinct* routed experts get\ntouched, since each resident routed expert is loaded once and reused by every\ntoken routed to it.\n\nThat distinct count is the\n[coupon-collector](https://en.wikipedia.org/wiki/Coupon_collector%27s_problem)\n(occupancy) expectation. A given expert is among the a token picks\nwith probability , so a single token misses it with probability ,\nand all tokens miss it (independently) with probability . The\nexpected number activated is\n\nThere are two interpretable limits. For expand the bracket: , every token drags in fresh experts and gets no sharing. For , , every expert is resident and the next token is free below the threshold. The crossover is the knee at . The marginal fresh-expert load per token is the derivative,\n\nArithmetic intensity is FLOPs over bytes. The MoE GEMM does FLOPs per expert-param and loads experts’ worth of weights at bytes each ( for MXFP4), so\n\nThe B200 fp4 ridge is FLOP/byte.\n\n### The attention roofline\n\n#### MLA\n\nThe cache is a single latent of width plus a shared rope key , read\nonce per context token; the query, though, is *per head*, so each of the\nheads drags its own copy in to dot product against that latent. Hence\n\nwhere and are the bytes per element of the (fp8) cache and (bf16) query counts, per head, the score over latent-plus-rope () and the value over the latent (), doubled for FLOPs per MAC. For V4-Flash (, , , , ): bytes, bytes, FLOP/pair.. The important asymmetry is then .\n\n#### V4: HCA and CSA\n\nDeepSeek-V4’s sparse attention mechanism has two variants, both based on MLA, called “Heavily Compressed Attention” (HCA) and “Compressed Sparse Attention” (CSA). They alternate in the backbones of DeepSeek-V4-Flash and Pro.\n\nHCA runs MLA, but over a sequence compressed 128× along its length, so every above becomes . That divides the compute () and the cache read () by 128, while leaving the query term alone, since the query is per token, not per context token. The right mental model is to read the MLA table at an effective context length .\n\nWith the constants above, HCA’s first speculative token () crosses the bf16 ridge at an original context length of about tokens. HCA does leave a real speculation band at short-to-mid context lengths; around tokens, moderately wide verifies remain memory-bound. But memory bound has to be caveated — the memory is split between pulling in KV cache for previous entries (which gets amortized over , so encourages speculation), and pulling in new (larger) query vectors (scales with , penalizes speculation). Speculation is useful between about k, and k.\n\nCSA does two different attention things. It first runs an ‘indexer’, which is a smaller MLA mechanism, scoring every (lightly, ~) compressed position with an index and keeping the top , then attends over only those.\n\nThe main attention part is plain MLA over that fixed -token sequence, so it sits exactly where the table above puts pure MLA at : just memory-bound at , compute-bound the instant you speculate. It’s capped, so it never grows with context.\n\nThe index is *itself* an MLAThe index keeps its own small cache, one 128-wide key per token in fp8,\nscored against 64 query heads. The cancels exactly as does\nin the main attention, leaving the same , just at score-only\n(half) cost.. It dots a query from each of 64 heads\nagainst a single shared 128-wide key per position, the same\none-latent-feeds-many-heads shape, with the same fat-query asymmetry. It\ncomputes the score but not the value, so its intensity is half the main\nattention’s, roughly , enough to keep it memory-bound for a token or\ntwo and compute-bound by , nearly regardless of context.\n\nSo pretty much wherever you look in V4’s attention, the compressed dense layers, the sparse selected attention, or the index that feeds it, the verify tokens have a real cost. The compression that makes the KV cache cheap is the very thing that removes some of the slack speculation needs.\n\n### The marginal-cost curve\n\nThe second chart prices one speculated token against a token already in the batch. Each of the three terms of is a roofline: a pass over some number of tokens costs , the compute growing with the token count and the memory set by whatever bytes that pass has to move. A token already in the batch pays the average, ; the speculated token wedged into the same step pays the marginal, . The ratio of the two is the local slope of the roofline,\n\nwhich is the only thing the chart draws. It splits on which side of the ridge we sit. Compute-bound, grows linearly with the tokens, so and the speculated token pays full price. Memory-bound, is the elasticity of the byte traffic: for a load that doesn’t grow with the batch, somewhere between for one that does.\n\nis standard. It reuses a fixed set of weights, so the bytes it moves don’t grow with the batch at all: until the fp8 GEMM saturates, after. The step lands where the GEMM’s two FLOPs per weight byte, summed over the batch, reach the fp8 ridge of , at tokens.\n\nmoves the distinct experts the batch touches, the coupon count from above, so on its memory branch . A lone token pays for all of its own experts and shares nothing ( as ); once the batch is past the knee at the experts are already resident and the next token rides them for free (). The always-on shared expert sits in the denominator as one more resident load, , which pulls the lone-token value at down to . Far out, where finally meets the fp4 ridge (the right edge of the first chart, ), the GEMM turns compute-bound and climbs back to .\n\nis MLA. Its memory is the KV read, bytes per context token, paid once per step however many query tokens ride along; its compute is the of the intensity model, growing with the verify width . So while the read dominates and once the compute overtakes it, with the crossover at tokens. Decode () sits just under that line, memory-bound, the usual reason a verify token is cheap. But , the first speculated token, is already over it: , and it stays there at every larger batch and context.\n\nThe black line is these three blended by how much of the bill each currently owns,\n\nplus the flat drafter cost , the term from the speedup, paid on every drafted token whether or not it survives verification.\n\n[Suggest an edit](https://github.com/fergusfinn/blog/edit/main/src/content/blog/economics-of-speculative-decoding.mdx)\n\nLast modified: 12 Jun 2026", "url": "https://wpnews.pro/news/the-economics-of-speculative-decoding", "canonical_source": "https://fergusfinn.com/blog/economics-of-speculative-decoding/", "published_at": "2026-06-08 00:00:00+00:00", "updated_at": "2026-06-12 13:14:50.055614+00:00", "lang": "en", "topics": ["artificial-intelligence", "machine-learning", "large-language-models", "ai-research", "ai-infrastructure"], "entities": ["Eagle 3.1", "DFlash", "SSD", "Llama", "Meta"], "alternates": {"html": "https://wpnews.pro/news/the-economics-of-speculative-decoding", "markdown": "https://wpnews.pro/news/the-economics-of-speculative-decoding.md", "text": "https://wpnews.pro/news/the-economics-of-speculative-decoding.txt", "jsonld": "https://wpnews.pro/news/the-economics-of-speculative-decoding.jsonld"}}