I Benchmarked Speculative Decoding — a = 3.5 Wasn't Enough

A developer benchmarked speculative decoding using Qwen2.5-0.5B-Instruct as the draft model and Qwen2.5-1.5B-Instruct as the target model on a CPU. Across code, JSON, and story generation tasks, speculative decoding was 49-62% slower than raw autoregressive generation, despite acceptance lengths exceeding the theoretical threshold. The zero-accept rate ranged from 15.8% to 30.2%, indicating that many draft rounds produced no accepted tokens, adding overhead without benefit.

In my last post https://dev.to/zxpmail/lossless-but-not-free-the-lossless-but-not-free-when-speculative-decoding-actually-pays-off-1c2g , I laid out the core inequality of Speculative Decoding: a 1 + α + β Acceptance length a must exceed 1 plus the draft/target compute ratio α plus verification overhead β . If it does, SD wins. If it doesn't, SD loses. That was theory. This post is the practice. I ran a real A/B test on my machine. The results were worse than I expected — and more interesting. Hardware: 12th Gen Intel, 64GB RAM CPU only . Yes, this means SD was always going to lose on raw speed. That wasn't the point — the point was measuring the acceptance length a across different task types. The speed numbers are secondary: they confirm the inequality on CPU, but the a values are what transfer to GPU deployments. Model pair: Qwen2.5-0.5B-Instruct draft → Qwen2.5-1.5B-Instruct target . Same model family, same tokenizer — a "well-matched" pair by any measure. Tasks 5 prompts each, 32 tokens per generation, greedy decoding : Draft length: k = 5 the default sweet spot I logged every round: draft length k, accepted count a, and wall time for both raw autoregressive generation and speculative decoding. Sample size note:5 prompts × 32 tokens = ~160 generated tokens per task type. Enough for directional signals and the qualitative patterns below — not enough for release-grade latency benchmarks. The a values converged within 3-4 prompts; the speed numbers are CPU-specific and should not be taken as absolute. | Task | Mean a | Median a | p10 a | Zero-accept rounds | Raw t/s | SD t/s | Speedup | |---|---|---|---|---|---|---|---| | code | 3.00 | 4.0 | 0.0 | 23.8% | 1.9 | 0.8 | -56% | | json | 3.50 | 5.0 | 0.0 | 15.8% | 1.8 | 0.9 | -49% | | story | 2.11 | 2.0 | 0.0 | 30.2% | 2.2 | 0.8 | -62% | Speculative Decoding was 49-62% slower across all three task types. The acceptance lengths were well above the 1 + α + β threshold. But SD still lost, and it wasn't close. The acceptance length varied significantly by task: This confirms the distribution shift argument from the theory post. The same model pair 0.5B → 1.5B, same family, same tokenizer produces very different acceptance rates depending on what you're generating. Your SD speedup will vary more by task than by model size. The most eye-opening metric wasn't the mean a — it was the zero-accept rate . 15-30% of draft rounds accepted exactly zero tokens. The draft model fired, generated 5 candidate tokens, and every single one was rejected. Those rounds are pure overhead: you paid for the draft run, paid for the verification run, and got nothing but a single token from the target model. In a round where a = 0 : SD cost 2x for the same output. And because draft models aren't free — even a 0.5B model has real compute cost — these zero-accept rounds are what drag down the average. The histogram of per-round a values tells the story: code: 0 0 0 0 0 0 0 0 0 0 1 1 2 2 3 4 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 ^^^^^^^^^^ ^^^^^^^^^^^^^^^^ 23.8% wasted 42.9% full hits The mode is 5 full hit and the second mode is 0 full miss . The mean 3.0 is somewhere in the middle, but the user experience is either "fast" or "very slow" — not 3.0. For latency-sensitive applications, the p10 or p25 a matters more than the mean. If 25% of your requests hit zero-accept rounds, your p99 tail will be significantly worse than raw autoregressive. This is where the CPU "bug" became a feature. My A/B test ran on CPU 12th Gen Intel, 64GB RAM . The 1.5B target model managed about 2 tokens/second. The 0.5B draft model managed about 6 tokens/second. That gives: α compute ratio on CPU : ≈ 0.3 Compare this to a GPU: α compute ratio on GPU, 7B→70B : ≈ 0.05–0.1 The inequality threshold shifts dramatically: | Platform | α | β | Threshold 1 + α + β | Our a | |---|---|---|---|---| | CPU | ~0.3 | ~0.10 | 1.40 | 2.1–3.5 | | GPU A100 | ~0.05 | ~0.02 | 1.07 | 2.1–3.5 | On GPU, our a values 2.1–3.5 clear the threshold comfortably. On CPU, they're in the marginal zone — and empirically, SD still lost. But the deeper insight is: SD is a GPU-bound optimization. The entire premise relies on the draft model being nearly free relative to the target. When the cost ratio α exceeds ~0.15, the headroom evaporates. And on CPU, with memory bandwidth as the bottleneck rather than compute, even a 3x smaller model doesn't come close to being "free." If you're running SD on CPU... don't. The numbers don't work. Let's plug the measured values into the inequality for the GPU scenario where SD is designed to run : Code a = 3.0 : 3.0 1.07 ✅ Clear win. Draft tokens accepted 3× faster than the overhead burns them. JSON a = 3.5 : 3.5 1.07 ✅ Clear win. The draft model matched the target nearly perfectly on structured output. Story a = 2.1 : 2.1 1.07 ✅ Marginal win. Clears the threshold, but with more zero-accept rounds eating into gains. The inequality works. It correctly predicts that SD wins on GPU and loses on CPU. It correctly predicts that story generation is riskier than code generation. But it doesn't capture everything. The zero-accept rate is a separate dimension — one that affects p99 latency more than throughput. If I were writing the inequality again, I'd add a variance term. You don't need a complex benchmark framework. Here's the core measurement loop: python def measure acceptance model, draft, tokenizer, prompt, k=5, max tokens=128 : """Log a and k for each speculative generation round.""" inputs = tokenizer prompt, return tensors="pt" generated = inputs "input ids" rounds = each element: {"k": int, "a": int} while generated.shape 1 < inputs "input ids" .shape 1 + max tokens: Draft: generate k candidate tokens draft out = draft.generate generated, max new tokens=k, do sample=False candidates = draft out 0, generated.shape 1 : .tolist actual k = len candidates Verify: check each candidate against target distribution verify input = torch.cat generated, torch.tensor candidates , dim=-1 logits = model verify input .logits 0 accepted = 0 for i, tok in enumerate candidates : target tok = logits generated.shape 1 -1+i .argmax .item if tok == target tok: accepted += 1 else: break rounds.append {"k": actual k, "a": accepted} Accept the verified tokens if accepted 0: generated = torch.cat generated, torch.tensor candidates :accepted , dim=-1 Generate next token from target out = model.generate generated, max new tokens=1, do sample=False generated = out return rounds Run it: data = measure acceptance target, draft, tokenizer, "Write a function..." a values = r "a" for r in data print f"Mean a: {sum a values /len a values :.2f}" print f"Zero-accept: {sum 1 for a in a values if a==0 /len a values 100:.1f}%" The custom loop above gives you per-round logging. If you're using HuggingFace generate , the built-in assistant model parameter offers the same acceleration with less code — but it doesn't expose per-round a values out of the box. Use the custom loop for measurement, switch to the built-in for production. Run this on 100+ samples per task type and split by task category. Don't average across all traffic — your code completions and your chat responses will have drastically different a values. 1. Measure your own a before trusting vendor benchmarks. Our model pair achieved anything from 2.1 to 3.5 depending on the task. If someone claims "85% speedup," ask: on what task, with what model pair, and what was the acceptance rate? 2. Don't average across tasks. A single mean a for your whole workload hides the story. Split by traffic type. If code routes have a = 3.5 and chat routes have a = 1.8, enable SD for code routes only. 3. Watch the zero-accept rate, not just the mean. A high zero-accept rate means worse p99 latency. In a system that must respond in 2 seconds, a 15% chance of being 30% slower is unacceptable. 4. SD is a GPU optimization. It works when α is tiny the draft model is nearly free relative to the target . On CPU, or on any platform where the draft model competes for memory bandwidth with the target, the inequality collapses. Benchmark on your target hardware. The theory says SD is lossless but not free. The practice confirms it — and adds nuance. Lossless: yes. The output distribution is identical. No hallucinations were amplified. Not free: more expensive than the simple 1 + α + β model suggests. The zero-accept variance, the task-dependent a, and the hardware-dependent α all eat into the theoretical speedup. The inequality still works. It correctly predicted every outcome in this test. But a point estimate of a isn't enough — you also need the distribution. The best optimization technique isn't the one that always wins. It's the one you know when to turn off. Now you know how to measure when that is. Tested with Qwen2.5 models, transformers 5.12, torch 2.12 CPU , Python 3.14 on Windows 10 64GB RAM, 12th Gen Intel .