Speculative Decoding: 20-50% Faster LLM Inference

wpnews.pro

Faster LLM inference without quality loss - a practical guide

A 70B model generates one token per forward pass, and each pass reloads weights from VRAM, computes attention across the context, and synchronizes memory. Between tokens, the GPU sits idle while it waits for sequential dependencies to resolve.

On an H100, a 70B model produces one token every 30-50ms. The GPU has enough compute capacity to process multiple tokens in parallel, but the sequential dependency prevents it — each token depends on the previous one, and the pipeline stalls.

Speculative decoding breaks that bottleneck by letting you generate multiple tokens in the time it normally takes to generate one, without changing the output distribution. The tokens you get are statistically identical to what you’d get from standard autoregressive decoding; the only difference is how fast you get them.

This guide covers the mechanics, the variants available in 2026, acceptance rate tradeoffs, and practical setup across llama.cpp, vLLM, SGLang, and TensorRT-LLM.

How Autoregressive Decoding Works (and Why It’s Slow) #

Before you can understand speculative decoding, you need to understand the autoregressive constraint it works around. Standard autoregressive generation processes tokens sequentially:

Run a forward pass through the model with the current context.
Sample the next token from the output distribution.
Append the token to the context.
Repeat.

Each step requires a full forward pass — weights from VRAM, computing attention across the entire context, and producing a single token. For a model with 70B parameters, this takes roughly 30-50ms per token on an H100. The GPU has compute capacity to spare — it could process more work in parallel — but the sequential dependency prevents it.

The Compute-VRAM Gap

Modern GPUs have more FLOPs than they need for single-token generation, so the real bottleneck is memory bandwidth — weights must be streamed from VRAM to the compute units for each forward pass. When generating one token at a time, the GPU spends most of its time waiting for memory transfers rather than doing useful compute.

Speculative decoding addresses this by giving the GPU more work per memory transfer. Instead of one token per forward pass, it generates K tokens per forward pass, amortizing the memory cost across multiple outputs.

The Draft-Verify Mechanism #

Speculative decoding works in repeating draft-verify cycles. A fast draft mechanism proposes K candidate tokens — from a small draft model, an n-gram lookup, or a prediction head attached to the target model — and the target model verifies all K in a single forward pass. The draft phase is cheap, typically 5-20% of the target model’s forward pass time, while verification compares each drafted token against what the target would have generated, accepting the longest matching prefix and resampling from the first rejection onward.

Verifying K tokens costs roughly the same as generating one token autoregressively, so when the draft is correct you get K tokens for the price of one verification step.

A Concrete Example

Suppose the draft model proposes 5 tokens: ["I", " like", " cooking", " and", " traveling"]

. The target model verifies them in a single forward pass:

Token	Draft	Target agrees?
1	“I”	✓
2	" like"	✓
3	" cooking"	✗ (target would say " playing")
4	" and"	— (not evaluated)
5	" traveling"	— (not evaluated)

The target accepts tokens 1 and 2, then generates " playing" for token 3, producing three tokens in one cycle instead of three separate forward passes. If the draft had been correct through token 5, you’d get five tokens for the cost of one verification — a 5x speedup on that cycle alone.

The Verification Bottleneck

In practice, verification dominates execution time — 42-95% of the cycle, depending on the method and model size. The target model’s forward pass is the bottleneck, and rejected tokens represent wasted compute.

This is why acceptance rate matters so much. Every rejected token after the first is wasted verification work. The best speculative decoding methods maximize the expected accepted tokens per cycle, not just the raw acceptance rate.

The Mathematical Guarantee #

One of the most important properties of speculative decoding is that it produces tokens from the exact same distribution as standard autoregressive sampling from the target model. The verification step uses rejection sampling — when the draft proposes token x, the target model computes its own probability p(x) and the draft computes p_draft(x). The acceptance probability is:

min(1, p(x) / p_draft(x))

When the target agrees (p(x) ≥ p_draft(x)), the token is always accepted. When the target disagrees, the token is accepted with probability proportional to the ratio, and rejected tokens are resampled from a residual distribution:

r(x) = max(0, p(x) - p_draft(x)) / Σ max(0, p(y) - p_draft(y))

This procedure guarantees that the output sequence follows the target model’s distribution exactly, which is why speculative decoding is lossless. The draft model influences speed, not quality — the tokens you get are statistically indistinguishable from standard decoding, with the same perplexity and distribution. The only difference is latency.

Draft Model Strategies #

The draft mechanism is the variable that matters most. Different approaches have different tradeoffs between setup complexity, acceptance rate, and speedup.

Standalone Draft Models

The simplest approach loads a smaller model alongside the target — typically a 1B-3B model drafting for a 7B-70B target.

Pros:

Conceptually straightforward
Works with any target model
Draft model can be tuned to match target’s distribution

Cons:

Requires a second model into VRAM (1-4 GB depending on size)
Draft model quality directly determines acceptance rate
Cross-family drafts (e.g., Qwen drafting for Llama) typically perform poorly

Rule of thumb: Use models from the same family. Gemma 2 2B drafts well for Gemma 2 27B. Llama 3.2 1B drafts well for Llama 3.1 70B. Cross-family drafts tend to have low acceptance rates because the token distributions diverge.

Finding Compatible Draft Models

Not all small models work as draft models for a given target. The critical factor is distribution alignment — how closely the draft model’s output probabilities match the target’s.

Target Model	Recommended Draft	Family Match
Llama 3.1 70B	Llama 3.2 1B-3B	Same
Llama 3.1 8B	Llama 3.2 1B	Same
Qwen 3 27B	Qwen 3 0.6B-1.8B	Same
Gemma 2 27B	Gemma 2 2B	Same
Mixtral 8x7B	Phi-3 4B (trained on Mixtral data)	Cross (careful)

The golden rule: if the draft model’s acceptance rate drops below 50%, speculative decoding may actually slow you down. The overhead of running the draft model plus verification outweighs the benefit when most proposals are rejected.

EAGLE and EAGLE-3: Prediction Heads #

EAGLE (Efficient Architecture Guided Language Model Estimation) eliminates the need for a separate draft model. Instead, it attaches lightweight autoregressive prediction heads to the target model’s internal layers.

How EAGLE Works

EAGLE trains prediction heads that take hidden states from the target model’s intermediate layers and predict future tokens. During inference:

The target model runs a forward pass through its layers.
At each layer, the EAGLE head reads the hidden state and proposes tokens for future positions.
Multiple heads operate in parallel, each predicting a different future timestep.
The target model verifies all proposals in a single pass.

The advantage: EAGLE heads are trained specifically to match the target model’s distribution. They see the target’s internal representations directly, which gives them much better alignment than a standalone draft model.

EAGLE-3 Improvements

EAGLE-3 (2025) refines the approach with three key changes:

Layer selection: Instead of attaching heads to every layer, EAGLE-3 uses Bayesian optimization to select the optimal exit layer, reducing overhead.Multi-token prediction: Each head predicts multiple tokens simultaneously, increasing the draft depth without proportional compute cost.Training efficiency: EAGLE-3 trains on the target model’s own generation data, improving acceptance rates on in-distribution workloads.

Acceptance rates: EAGLE-3 typically achieves 60-80% acceptance rates on in-distribution workloads, compared to 40-60% for standalone draft models. On code generation workloads with high repetition, acceptance can exceed 85%.

Setup: EAGLE-3 requires pre-trained heads for your target model. NVIDIA provides EAGLE-3 heads for several popular models through TensorRT-LLM and the Speculative Decoding Modules collection on HuggingFace. Third-party implementations exist for vLLM and SGLang.

P-EAGLE: Parallel Drafting (March 2026)

EAGLE-3’s main limitation is autoregressive drafting — each draft token depends on the previous one, so generating K draft tokens requires K sequential forward passes through the draft head, and draft overhead grows linearly with K. P-EAGLE removes this ceiling by generating all K draft tokens in a single forward pass through a lightweight 4-layer drafter trained to predict up to 10 tokens in parallel.

The result: P-EAGLE delivers up to 1.69x speedup over vanilla EAGLE-3 on real workloads on NVIDIA B200. The advantage widens at higher K values — where EAGLE-3’s sequential drafting becomes a bottleneck, P-EAGLE’s parallel drafting incurs no additional cost.

Setup in vLLM: Download a pre-trained P-EAGLE head from HuggingFace, set "parallel_drafting": true

in your vLLM config, and use the same --speculative-model

flag — vLLM handles the rest. P-EAGLE is the current state-of-the-art for EAGLE-based speculative decoding, and if you’re deploying EAGLE in 2026, P-EAGLE is the variant to use.

n-gram Speculative Decoding #

n-gram speculative decoding replaces a neural draft with pattern matching against prompt history. The algorithm looks for repeated n-gram sequences in the context, and when the current token sequence matches a previously seen pattern, it proposes the tokens that followed that pattern earlier — for example, if the model has already generated def calculate_total(items):

and encounters def calculate_total(

again, it knows the next tokens will likely be items):

based on the previous occurrence.

The n-gram map variants (ngram-map-k

, ngram-map-k4v

) use hash tables for faster lookups instead of linear scanning, with the hash key as the current n-gram of size N and the value as the token sequence that followed.

Pros:

Zero VRAM overhead — no additional model to load (~16 MB for the hash table)
Extremely fast for repetitive workloads (code editing, refactoring, template generation)
Acceptance rates can reach 90%+ on workloads with high self-similarity

Cons:

Useless for novel generation — if the pattern hasn’t appeared before, n-gram has nothing to propose
Acceptance rate drops to near zero on creative or diverse workloads
Limited draft depth (typically 2-4 tokens per match)

Best for: Code refactoring, template filling, repetitive documentation, and any workload where the model revisits similar patterns. Worst for: creative writing, open-ended chat, and reasoning tasks.

Parameter Tuning

The n-gram parameters matter more than you’d expect. The defaults work for code, but text workloads need adjustment:

Parameter	Default	Code	Text
`size-n` (lookup length)
12	12-16	8-10	Longer n-grams reduce false positives but miss shorter patterns
`size-m` (draft length)
48	48	32	Longer drafts mean more tokens per match, but also more rejections
`min-hits`
1	1	2	Higher min-hits reduces false positives at the cost of fewer matches

For text workloads, reduce size-n

to 8-10 and increase min-hits

to 2. This trades off match frequency for higher acceptance rates per match.

Self-Speculative Decoding #

Self-speculative decoding (also called LayerSkip or self-speculation) uses the model’s own partial computation as the draft, so no separate model is needed.

How It Works

Instead of running the full model for each token, self-speculative decoding runs a truncated version — skipping some transformer layers — to generate draft tokens cheaply, and the full model then verifies the proposals.

For example, a 32-layer model might run with only 16 layers for drafting, then verify with all 32 layers. The truncated forward pass is faster because it processes fewer layers, and the draft tokens benefit from seeing the same initial layers as the target.

Pros:

No additional model weights to load
Naturally aligned with the target distribution (same architecture, partial layers)
Works well for models with significant redundancy in deeper layers

Cons:

Requires modifying the inference engine to support partial forward passes
KV cache complications — the draft uses partial KV cache that must be reconciled with the full model’s cache
Acceptance rates are typically lower than EAGLE or well-tuned draft models

llama.cpp implementation: PR #18471 introduced self-speculative decoding using context history as draft. The model reuses tokens from its own generation history to propose continuations, particularly effective for coding workloads where patterns repeat within the same context window.

MTP (Multi-Token Prediction) #

MTP is a specialized form of speculative decoding built directly into certain model checkpoints. Qwen 3.6 ships both standard and MTP-enabled GGUF variants.

How it differs: MTP heads are baked into the model architecture during training. The model carries extra prediction heads that propose multiple future tokens in a single forward pass. There’s no separate draft model — the MTP heads are part of the target model itself.

Tradeoffs:

No draft model to manage — MTP is activated with --spec-type draft-mtp --spec-draft-n-max N
MTP heads add ~1-2 GB VRAM overhead
Works best on MoE architectures (Qwen 3.6 35B-A3B) where sparse routing keeps MTP heads cheap

For detailed benchmarks on MTP vs standard decoding across Qwen 3.6 27B and 35B, see Qwen 3.6 MTP vs Standard on 16GB GPU.

Acceptance Rates: What They Mean in Practice #

Acceptance rate (α) is the single most important metric for speculative decoding performance. It determines whether you’re getting a speedup or paying overhead.

The Speedup Formula

Expected accepted tokens per verification pass:

E[accepted] = α × K

Where K is the number of draft tokens proposed per cycle. If α = 0.7 and K = 5, you accept 3.5 tokens per pass — a 3.5x speedup over standard decoding (which produces 1 token per pass).

Acceptance Rate by Method

Method	Typical α Range	Best Workload
Draft model (same family)	40-60%	General chat, reasoning
Draft model (cross-family)	20-40%	Rarely recommended
EAGLE-3	60-80%	General workloads, code
P-EAGLE	65-85%	General workloads, deeper speculation
n-gram	10-90%+	Workload-dependent (high on repetitive, near zero on novel)
MTP	50-70%	Qwen 3.6 models specifically
Self-speculative	30-50%	Coding, repetitive patterns

When Acceptance Rate Drops

Acceptance rate is not constant across a generation. It varies by:

Token position: Early tokens tend to have higher acceptance (more context, less uncertainty). Later tokens drop as the model explores more diverse continuations.Workload type: Code editing with repeated patterns sees α > 80%. Open-ended creative writing sees α < 40%.Temperature: Higher temperature increases divergence between draft and target, lowering acceptance. Speculative decoding works best at low temperature (0.0-0.7).

Critical threshold: If your effective acceptance rate (α × K) drops below 1.0, speculative decoding is slower than standard decoding. The draft overhead plus verification time exceeds the cost of a single autoregressive step.

Speculative Decoding in Production: What Actually Happens #

Research papers report 2-4x speedups, but production benchmarks tell a more nuanced story — speedups shrink with batch size, verification dominates cycle time, and no single method wins on every workload.

SpecDecode-Bench Findings (2026)

A systematic evaluation of five SD variants (n-gram, EAGLE, EAGLE-3, Draft-Model, MTP) on vLLM across four models and six workloads revealed:

SD works, but speedups shrink with batch size. At batch size 1, EAGLE achieves up to 1.96x on Llama-3-70B. By batch size 128, this drops to 1.21x. The system becomes compute-bound at high concurrency, and the GPU has less idle capacity to spare for speculation. - Verification dominates execution time (42-95%). The target model’s forward pass is the bottleneck. Reducing wasted verification on rejected tokens is the most promising avenue for improvement. - No single method wins everywhere. EAGLE-3 is the best all-around choice. Draft-model methods excel when the target model is large (70B+). n-gram is optimal for code editing and high-overlap tasks. - Oracle analysis reveals a gap. The theoretical upper bound for combined n-gram + EAGLE strategies reaches ~4.9x on code editing workloads, but current implementations achieve 2-3x. There’s room for optimization.

Practical Speedup Expectations

Scenario	Expected Speedup
70B model, single request, EAGLE-3	1.5-2.0x
70B model, batch 32, EAGLE-3	1.2-1.5x
8B model, single request, draft model	1.3-1.8x
Code editing, n-gram	2.0-4.0x (workload dependent)
Creative writing, any method	1.0-1.3x (often not worth it)
MTP on Qwen 3.6 27B, 16GB GPU	1.5-1.7x
P-EAGLE on B200, single request	2.0-3.0x

The batch size effect is critical. At small batches, the GPU has idle compute to spare for speculation. At large batches, the system is already saturated, and speculative decoding adds overhead without proportional benefit.

Monitoring in Production

You should track acceptance rate in production. A declining acceptance rate signals that your draft model is diverging from the target — either because the workload changed, or because the draft model needs retraining.

Key metrics to monitor:

Acceptance rate per request (should be stable around your baseline)
Tokens per second with vs without speculative decoding (the actual speedup)
Verification time as percentage of cycle time (should be 42-95%)
Draft model forward pass time (should be < 20% of target model time)

If your acceptance rate drops below 40%, disable speculative decoding for that request. The overhead is not worth it.

Practical Setup #

Engine choice matters as much as draft strategy — see Ollama vs vLLM vs LM Studio and other local runtimes for how each runtime handles batching, API compatibility, and throughput before you pick a speculative decoding path.

llama.cpp

For general server setup and GGUF , start with the llama.cpp quickstart; the flags below add speculative decoding on top.

llama.cpp supports multiple speculative decoding methods through the --spec-type

flag:

llama-server \
  --model target-model.gguf \
  --draft-model draft-model.gguf \
  --spec-draft-n-max 4 \
  --parallel 1  # Mandatory: --parallel 1 for speculative decoding

llama-server \
  --model target-model.gguf \
  --spec-type ngram-simple \
  --spec-ngram-simple-size-n 12 \
  --spec-ngram-simple-size-m 48

llama-server \
  --model target-model.gguf \
  --spec-type ngram-simple \
  --spec-ngram-simple-size-n 8 \
  --spec-ngram-simple-size-m 32 \
  --spec-ngram-simple-min-hits 2

llama-server \
  --model Qwen3.6-27B-MTP.gguf \
  --spec-type draft-mtp \
  --spec-draft-n-max 2

llama-server \
  --model target-model.gguf \
  --spec-type draft-self

Critical flags:

--parallel 1

— Speculative decoding in llama.cpp requires single-batch mode. This is a current limitation.--spec-draft-n-max

— Number of draft tokens per cycle. Start with 3-5; higher values increase VRAM pressure.--spec-ngram-simple-size-n

— Lookup n-gram length. Default 12 works well for code; reduce to 8 for text.

Common pitfalls:

Forgetting --parallel 1

— the server will silently ignore speculative decoding. - Using cross-family draft models — acceptance rates collapse, negating any speedup.

Setting --spec-draft-n-max

too high — each extra draft token costs VRAM for the draft buffer. Diminishing returns kick in around 5-8.

vLLM

The vLLM quickstart covers base deployment; the flags below enable speculative decoding on an existing vLLM server.

vLLM supports speculative decoding through the --speculative-model

and --speculative-num-steps

flags:

vllm serve target-model \
  --speculative-model draft-model \
  --speculative-num-steps 5 \
  --speculative-accept-length 5

vllm serve target-model \
  --speculative-model EAGLE-target-model/ \
  --speculative-num-steps 7 \
  --speculative-draft-tensor-parallel-size 1

vllm serve target-model \
  --speculative-model P-EAGLE-target-model/ \
  --speculative-num-steps 7 \
  --speculative-parallel-drafting true

vllm serve target-model \
  --speculative-method ngram \
  --speculative-num-steps 5 \
  --ngram-context-size 12

vLLM’s speculative decoding is integrated with continuous batching, so it works under concurrent workloads. The scheduler handles multiple token slots within a single forward pass, and the memory manager handles KV cache for both draft and target models.

SGLang

SGLang supports speculative decoding through its --speculative-algorithm

flag:

python -m sglang.launch_server \
  --model-path target-model \
  --speculative-algorithm ngram \
  --ngram-context-size 12 \
  --ngram-max-candidate-tokens 6

SGLang’s RadixAttention architecture pairs well with speculative decoding because prefix caching reduces verification cost — the target model reuses cached attention for shared prefixes, making each verification pass cheaper than a cold forward pass.

TensorRT-LLM

TensorRT-LLM provides production-grade speculative decoding with Triton Inference Server. The setup is more involved but offers the best performance on NVIDIA hardware:

Build the TensorRT engine for both target and draft models.
Configure the model repository with model.yaml

specifying the speculative decoding configuration. - Launch Triton with the LLM API / PyTorch backend.

TensorRT-LLM supports both draft-model and EAGLE-3 variants. For code generation workloads, TensorRT-LLM with n-gram speculative decoding has demonstrated 2-3x latency reduction in production deployments.

When to Use Speculative Decoding #

Use It When

Large target models (7B+): The overhead of the draft mechanism is amortized across the target’s compute. Speculative decoding shines when the target model is slow — the larger the target, the more valuable the speedup.Low-temperature workloads: Speculative decoding works best at temperature 0.0-0.7, where the target model’s distribution is concentrated and the draft has a better chance of matching.Interactive applications: Latency-sensitive workloads (chat, code completion, agent tool calls) benefit most. Batch processing where you’re already saturating the GPU sees less benefit.Code generation and editing: High repetition in code patterns makes n-gram and self-speculative decoding particularly effective.

Skip It When

Small target models (< 3B): The draft model’s overhead approaches the target’s forward pass time. The speedup is marginal or negative.High-temperature sampling: At temperature > 0.7, the target model’s distribution is too broad for the draft to match reliably.Creative writing and open-ended generation: Low acceptance rates on novel content make the overhead not worth it.High batch sizes (> 32): The system becomes compute-bound, and speculative decoding adds overhead without proportional benefit. SpecDecode-Bench shows speedup dropping from 1.96x to 1.21x as batch size goes from 1 to 128.

Combining Methods #

Advanced setups combine multiple speculative decoding strategies. The SpecDecode-Bench oracle analysis showed that adaptively combining n-gram and EAGLE can push speedup to 4.9x on code editing workloads.

The idea is to use n-gram for patterns that have appeared before, where acceptance is high and overhead is near zero, and fall back to EAGLE for novel tokens. In practice this requires engine support for multi-method speculation — vLLM and TensorRT-LLM have experimental support, but production-grade implementations are still maturing.

For now, the most practical combination is MTP + n-gram in llama.cpp. MTP handles the neural speculation, while n-gram catches repetitive patterns that MTP misses. On Qwen 3 27B, this combination achieves 120 tokens/sec compared to 67 tokens/sec standard — a 1.8x speedup.

Cost Considerations #

Speculative decoding trades compute for latency. The total compute per token is roughly the same — you’re just doing more work in parallel rather than sequentially.

GPU cost impact:

Single-request latency improves by 20-50%, which matters for interactive applications.
Throughput (tokens/sec across many requests) improves less — the GPU is already saturated at high batch sizes.
VRAM usage increases by the draft model’s footprint (1-4 GB for standalone drafts, minimal for n-gram/EAGLE).

Cloud inference: At $2-4/hr per H100, speculative decoding reduces per-request latency without increasing per-token cost. For batch processing where you’re already saturating the GPU, the cost benefit is minimal — you’re paying for the same GPU time either way.

When speculative decoding saves money: Interactive applications where you charge per request and want to reduce time-to-first-token. A 2x speedup means your users wait half as long, and you can serve more requests per second on the same hardware.

When it doesn’t: Batch processing where you’re already maximizing GPU utilization. The extra compute from speculative decoding doesn’t increase throughput — it just changes the latency profile.

What’s Next #

Speculative decoding is maturing from research novelty to production standard. The frontier is pushing beyond the current limitations:

Speculative Speculative Decoding (SSD): Parallelizes the drafting and verification stages across separate hardware. The draft model runs asynchronously, pre-speculating for multiple likely verification outcomes. Early results show up to 2x speedup over optimized speculative decoding, and 5x over autoregressive decoding. Not production-ready yet, but the direction is clear. - SpecSA (Sparse Speculative Verification): Combines speculative decoding with dynamic sparse attention. Turns sparse attention into a verification-oriented workload, achieving up to 3.49x end-to-end throughput over autoregressive sparse decoding. Relevant for long-context models where sparse attention is already in use. - Adaptive speculation: Automatically switching between n-gram, EAGLE, and draft-model methods based on workload characteristics. The oracle analysis shows significant untapped potential — current implementations achieve 2-3x, but the theoretical bound is 4.9x. - Multimodal speculative decoding: Extending draft-verify to vision-language models and video generation. Early surveys show the same principles apply, but verification strategies need adaptation for non-text modalities.

Decision Framework #

Question	Answer	Recommendation
Target model size?	< 3B	Skip speculative decoding
Target model size?	7-13B	Use n-gram or self-speculative (low overhead)
Target model size?	30B+	Use draft model or EAGLE-3 (larger target = more benefit)
Workload type?	Code editing/refactoring	n-gram + EAGLE combination
Workload type?	General chat	EAGLE-3 or P-EAGLE
Workload type?	Creative writing	Skip speculative decoding
Batch size?	1-4 (interactive)	Speculative decoding helps most
Batch size?	32+ (throughput)	Speculative decoding helps less
Temperature?	0.0-0.7	Good for speculative decoding
Temperature?	> 0.7	Skip speculative decoding
Hardware?	16GB GPU	Use n-gram or MTP (low VRAM overhead)
Hardware?	24GB+ GPU	Draft model or EAGLE-3 feasible
Engine?	vLLM	EAGLE-3 or P-EAGLE (best integration)
Engine?	llama.cpp	n-gram or MTP (simplest setup)
Engine?	TensorRT-LLM	EAGLE-3 or draft model (production-grade)

source & further reading

glukhov.org — original article Multi-Agent Orchestration Patterns: A Practical Guide What Is Spec-Driven Development? The Spec as Source of Truth Spec-Driven Development vs Vibe Coding: Waterfall?

Speculative Decoding: 20-50% Faster LLM Inference

How Autoregressive Decoding Works (and Why It’s Slow) #

The Compute-VRAM Gap

The Draft-Verify Mechanism #

A Concrete Example

The Verification Bottleneck

The Mathematical Guarantee #

Draft Model Strategies #

Standalone Draft Models

Finding Compatible Draft Models

EAGLE and EAGLE-3: Prediction Heads #

How EAGLE Works

EAGLE-3 Improvements

P-EAGLE: Parallel Drafting (March 2026)

n-gram Speculative Decoding #

Parameter Tuning

Self-Speculative Decoding #

How It Works

MTP (Multi-Token Prediction) #

Acceptance Rates: What They Mean in Practice #

The Speedup Formula

Acceptance Rate by Method

When Acceptance Rate Drops

Speculative Decoding in Production: What Actually Happens #

SpecDecode-Bench Findings (2026)

Practical Speedup Expectations

Monitoring in Production

Practical Setup #

llama.cpp

vLLM

SGLang

TensorRT-LLM

When to Use Speculative Decoding #

Use It When

Skip It When

Combining Methods #

Cost Considerations #

What’s Next #

Decision Framework #

Run your AI side-project on zahid.host