{"slug": "exploring-speculative-decoding-from-concept-to-implementation", "title": "Exploring Speculative Decoding: From Concept to Implementation", "summary": "Speculative decoding optimizes LLM inference by using a cheap draft model to predict multiple tokens, which are then verified in a single forward pass of the target model, reducing memory-bandwidth bottlenecks and improving GPU utilization. The technique leverages the KV cache and can yield significant efficiency gains when draft tokens are accurate.", "body_md": "# Exploring Speculative Decoding: From Concept to Implementation\n\nIn this post, we explore speculative decoding through a concrete vLLM-focused implementation, covering draft models, EAGLE, MTP, and the tradeoffs involved.\n\n## Intro\n\nIn this post, I’ll discuss speculative decoding, a technique used to optimize LLM inference. It’s one those things that when you first learn about it, it somehow just clicks in an “oh yeah! that makes sense” way.\n\nBut first, I want to motivate why LLM inference optimzation matters.\n\nLLMs generate responses to user queries. We train large models once, often at 8- or 9-figure costs, but serving them is what happens millions of times. Running a model requires some serious hardware, and saying GPUs are expensive and are in limited supply is almost an understatement. A small efficiency gain means very large savings over time.\n\n## Refresher on LLM Inference Basics\n\nModern GPUs are impressive beasts. But they have their quirks. A GPU can run hundreds of trillions of operations per second, yet it can only move a few trillion bytes from GPU memory to the compute units.\n\nLLM inference is autoregressive. If we have some input tokens `[t_1 .. t_n]`\n\n, the model gives us a logits vector of size `vocab_size`\n\nfor each token in the sequence. We use the logits of the last token`t_n`\n\nto sample the the next token `t_{n+1}`\n\n.\n\nThat means unless we run hundreds of operations per byte — which we don’t in LLMs — we’re essentially memory-bandwidth bound.\n\nWhen we batch, we perform more ops per set of weights (X @ W): the larger the batch, the more we reuse the weights (W) we brought from memory, and the better we utilize the GPU.\n\n## From KV Cache to Speculative Decoding\n\nEach token goes through multiple layers, and each layer has a few standard blocks: normalization, MLP, and most notably the transformer’s attention block. In every block before attention, a token does not care about other tokens. In attention, token `t_i`\n\nneeds to know about tokens `t_0 .. t_{i-1}`\n\n. Specifically, it needs access to the keys and values for those tokens at that particular layer.\n\nOne key optimization that LLM inference engines bring to the table is that, instead of recalculating those K and V tensors for every new token, we store them and only run the calculation for the new tokens. That is the famous KV cache. As new tokens go through the model, their key and value vectors are added to the cache.\n\nThe sampled token `t_{n+1}`\n\nis added to the input, and its logits are used to sample `t_{n+2}`\n\nand so on. During the first model run, we calculate `n`\n\nlogits vectors in the output even though we only need the last one for generation. And the difference between calculating 1 or a few candidate tokens in the same forward pass is often much smaller than running several decode steps one after the other.\n\nFor each run or forward pass that produces a new token, we need to load all the models weights and we need to reload them for the next one and so on. And since the memory bandwidth is the limiting factor, we’re waiting for weights to be loaded most of the time.\n\nThat is where speculative decoding comes in. If we can guess a few likely tokens ahead of time and feed them to the model, we can verify them in one pass. If they are correct, we get those tokens almost for free. f Regular:\n\n`[t_1 .. t_n] => t_{n+1}`\n\nSpec dec:\n\n`[t_1 .. t_n] => t_{n+1}, t_{n+2}, t_{n+3}, ...`\n\nOf course, this only works if the guessed tokens, which we call draft tokens, are usually correct and the guessing process is much cheaper than a full forward pass on the target model. If not, we might as well just use the large original model, which we refer to as the target.\n\nThere are many techniques to generate these draft tokens: n-gram, EAGLE, MTP, and others.\n\nBut the idea is the same. One forward pass is normally used to sample one token. If we can run a cheaper draft process and predict a few extra tokens, we can reduce the number of expensive target-model steps. If we do things right, we can also preserve the target model’s distribution exactly, as if there were no draft model at all.\n\n## Pseudo code\n\n```\n1\n2\n3\n4\n5\n6\n7\n8\n9\n10\npython\ndef propose(tokens, num_speculative_tokens):\n    draft_tokens = []\n\n    for _ in range(num_speculative_tokens):\n        logits = generate_draft_token(tokens)  # this must be very fast compared to the target model\n        token = sample(logits)\n        draft_tokens.append(token)\n        tokens.append(token)\n\n    return tokens, draft_tokens\n```\n\nOnce we have the draft tokens, we verify them:\n\n```\n1\n2\n3\n4\n5\n6\n7\n8\n9\n10\n11\n12\npython\ndef sample_verify(tokens, num_drafts):\n    logits = target_model(tokens)  # run target model forward, this is the true distribution\n\n    for i, logit in enumerate(logits[-num_drafts:]):\n        target_token = sample(logit)\n        draft_token = tokens[-num_drafts + i]\n\n        if target_token == draft_token:\n            accept(draft_token)\n        else:\n            reject(draft_token)\n            break\n```\n\n## Guaranteeing The Original Distribution\n\nThe important part is that speculative decoding doesn’t sacrifice correctness for speed. The final output follows the same distribution as the target model.\n\nThis is mathematically provable if we follow this algorithm:\n\n```\n1\n2\n3\n4\n5\n6\n7\n8\n# this step should be way faster than regular decode\n1. Sample tokens using the draft model via probability distribution q(x)\n# verify multiple tokens cheaply in one pass\n2. Target model computes true distribution p(x) \n3. Accept the sampled token with probability `min(1, p(x)/q(x))`. Two cases:\n   a. If p(x) >= q(x), we accept with probability 1. the target likes the token at least as much as the draft, so we keep it.\n   b. If p(x) < q(x), we accept with probability p(x)/q(x), which is less than 1.\n4. If rejected, sample a correction token from max(0, p(x)-q(x))\n```\n\nStep 4 is key: it’s where we account for the discrepancy between the draft and target. By resampling from p-q, we cover the tokens the draft was overlooking.\n\nStep 3b is also important. If the draft thinks token 2 has probability 0.4 but the target thinks it’s 0.2, then p/q = 0.2/0.4 = 0.5, so we accept it half the time. This makes sense: the draft is overconfident, and we correct for that via the ratio p/q.\n\nIf p > q, the target likes the token more than the draft does, so we just accept it.\n\n## Speculators\n\nThere are many ways to get these draft tokens. The criteria we are optimizing for are being as close as possible to the target model and being faster.\n\n### Smaller Draft Model\n\nThe most intuitive one is probably using a smaller model. Imagine you have a 470B model and you rely on a 7B model from the same family.\n\nFor complex tokens, the small model will not perform well, but for repetitive and easy stuff, it should be decent. For example:\n\n`Q: How can we solve special relativity? A: To solve special ...`\n\nThe small model should easily guess that we will repeat part of the question and provide appropriate draft tokens. The rest of the answer is harder, but if we get the easy parts right, we still come out ahead.\n\nWe can come up with different techniques to generate draft tokens. It is just a function that takes the existing sequence and tries to predict `K`\n\ndraft tokens. In vLLM, this is implemented as a pluggable speculator.\n\nThe snippets below are trimmed from vLLM and keep only the skeleton and the key data flow.\n\nThe [shared proposer base](https://github.com/vllm-project/vllm/blob/6bdabbad5bce747865fd3a249658518a4269cc22/vllm/v1/spec_decode/llm_base_proposer.py#L55) looks like this:\n\n```\n1\n2\n3\n4\n5\n6\n7\n8\n9\n10\n11\n12\n13\n14\n15\n16\n17\n18\n19\n20\npython\nclass SpecDecodeBaseProposer:\n    def __init__(self, vllm_config, device, pass_hidden_states_to_model, runner=None):\n\n    @torch.inference_mode()\n    def propose(\n        self,\n        target_token_ids,\n        target_positions,\n        target_hidden_states,\n        next_token_ids,\n        token_indices_to_sample,\n        common_attn_metadata,\n        sampling_metadata,\n        mm_embed_inputs=None,\n        num_rejected_tokens_gpu=None,\n        slot_mappings=None,\n    ):\n        # Take in current token and extra data that might be used (depending on the proposer)\n        # then generate draft tokens and their draft logits/probs\n        ...\n```\n\n### N-gram\n\nThings tend to repeat themselves. What goes around comes around. This is a general principle in computing that underlies caching: if we see some data, we are likely to use it again. That is locality. Following this principle, we can look at the last `N`\n\ntokens in our sequence and search for a previous occurrence. Our draft tokens are then the `K`\n\ntokens that came after that previous occurrence.\n\nFrom the example above, the suffix of the sequence is “solve special”, its previous occurrence is a few words earlier, and what comes after it is “relativity?”. We guess that as the likely draft tokens, and we end up being right.\n\nN-gram is a very simple and cheap technique to run. We can even run it on the CPU. Its simplicity also means that it is often wrong in practice, but it can be quite useful for text with repetitive patterns, and code is the perfect example.\n\n```\n1\n2\n3\n4\n5\n6\n7\n8\n9\n10\n11\n12\n13\n14\n15\n16\n17\n18\n19\n20\n21\n22\n23\n24\n25\n26\n27\n28\n29\n30\n31\n32\n33\n34\n35\n36\npython\nclass NgramProposer:\n    def __init__(self, vllm_config):\n        # Draft length and match window come from speculative config.\n        self.min_n = vllm_config.speculative_config.prompt_lookup_min\n        self.max_n = vllm_config.speculative_config.prompt_lookup_max\n        self.k = vllm_config.speculative_config.num_speculative_tokens\n        self.max_model_len = vllm_config.model_config.max_model_len\n\n    def propose(self, sampled_token_ids, num_tokens_no_spec, token_ids_cpu, slot_mappings=None):\n        # Only speculate for requests that actually sampled a token.\n        valid_requests = [i for i, sampled_ids in enumerate(sampled_token_ids)\n                          if sampled_ids and num_tokens_no_spec[i] < self.max_model_len]\n        return self.batch_propose(len(sampled_token_ids), valid_requests, num_tokens_no_spec, token_ids_cpu)\n\n    def batch_propose(...):\n        for i in prange(len(valid_ngram_requests)):\n            idx = valid_ngram_requests[i]\n            num_tokens = num_tokens_no_spec[idx]\n            context_token_ids = token_ids_cpu[idx, :num_tokens]\n            drafter_output = _find_longest_matched_ngram_and_propose_tokens(\n                origin_tokens=context_token_ids,\n                min_ngram=min_n,\n                max_ngram=max_n,\n                max_model_len=max_model_len,\n                k=k,\n            )\n\n            valid_ngram_num_drafts[idx] = drafter_output.shape[0]\n            if len(drafter_output):\n                valid_ngram_draft[idx, : drafter_output.shape[0]] = drafter_output\n\ndef _find_longest_matched_ngram_and_propose_tokens(origin_tokens, min_ngram, max_ngram, max_model_len, k):\n    # use Knuth–Morris–Pratt (KMP) algorithm to match longest pattern\n    # this video explains it **neatly**: https://www.youtube.com/watch?v=JoF0Z7nVSrA\n    # if you're not familiar with it, it's worth a watch\n```\n\n### EAGLE\n\nAlthough the small draft model looks good in theory, in practice it is still lacking because these are fundamentally two different models that learn different things.\n\nAn interesting technique is EAGLE, which has several iterations: EAGLE1, EAGLE2, EAGLE3, and most recently EAGLE3.1. The key idea is that the target model is already doing most of the heavy lifting and has all the information needed to predict the next tokens. Some of that information lives in the intermediate hidden states.\n\nSo instead of relying only on the token embedding, EAGLE uses the embedding plus hidden states from the target model as input to a lightweight draft network. That draft network then predicts the next token.\n\nIn vLLM’s EAGLE-3 setup, the target model produces hidden states at selected layers. Those states are concatenated and projected through a fully connected layer, then passed through lightweight decoder layers and an LM head to produce draft logits. The draft model is still autoregressive, but it relies on the target model’s hidden states.\n\nEssentially, we add a new decode layer that takes as input all the tokens including the last sampled one and their hidden layers, then uses that information to predict the next draft token.\n\nFor the first draft token, we do not have a target-model hidden state for the new token yet, so the draft model has to rely on the hidden states it already has. That is one reason why the hidden state design matters.\n\nThe main difference between earlier EAGLE versions and EAGLE3 is that the earlier versions focused on the last hidden layer, while EAGLE3 uses multiple layers, usually spanning early, middle, and late stages, to capture a broader view of the model’s reasoning.\n\nThe diagram above from the EAGLE3 paper illustrates the idea. We had “How can” and we just predicted “I” using the target model. For each predicted token `i`\n\n, the hidden state that led to it comes from `i-1`\n\n. So “How”’s hidden state led to “can”, and “can”’s hidden state led to “I”. In the draft model, we combine the hidden state of `i-1`\n\nwith the token embedding of `i`\n\n.\n\nThe hidden states from low, middle, and high layers are used. For each token, those hidden state vectors are concatenated and combined using a learned fully connected block that is trained to pick and combine the relevant information across the different stages.\n\n```\n1\n2\n3\n4\n5\n6\n7\n8\n9\n10\n11\n12\n13\n14\n15\n16\n17\n18\n19\n20\n21\n22\n23\n24\n25\n26\n27\n28\n29\n30\n31\n32\n33\n34\n35\n36\n37\n38\n39\n40\n41\n42\n43\n44\n45\n46\n47\n48\n49\n50\n51\n52\n53\n54\n55\n56\n57\n58\n59\n60\n61\n62\n63\n64\n65\n66\n67\n68\n69\n70\n71\n72\n73\n74\n75\n76\n77\n78\n79\n80\n81\n82\n83\n84\n85\n86\n87\n88\n89\n90\n91\n92\n93\n94\n95\n96\n97\n98\n99\n# vllm/v1/worker/gpu/spec_decode/eagle/speculator.py:405-567\n\n@torch.inference_mode()\ndef propose(\n    self,\n    input_batch: InputBatch,\n    attn_metadata: dict[str, Any],\n    slot_mappings: dict[str, torch.Tensor],\n    last_hidden_states: torch.Tensor,     # [num_tokens, H] — target's final layer\n    aux_hidden_states: list[torch.Tensor] | None,  # EAGLE-3: 3 × [num_tokens, H]\n    num_sampled: torch.Tensor,            # [num_reqs] — accepted count from prev iter\n    num_rejected: torch.Tensor,           # [num_reqs] — rejected count from prev iter\n    last_sampled: torch.Tensor,           # [max_num_reqs] — last accepted token/request\n    next_prefill_tokens: torch.Tensor,    # [max_num_reqs] — for chunked prefills\n    temperature: torch.Tensor,\n    seeds: torch.Tensor,\n    ...\n) -> torch.Tensor:\n    num_tokens = input_batch.num_tokens_after_padding\n    num_reqs = input_batch.num_reqs\n    max_query_len = input_batch.num_scheduled_tokens.max()\n\n    \n    # STEP 1: FC FUSION (EAGLE-3 only)\n    \n    if aux_hidden_states:\n        assert self.method == \"eagle3\"\n        hidden_states = self.model.combine_hidden_states(\n            torch.cat(aux_hidden_states, dim=-1)\n        )\n    else:\n        # EAGLE-1/2: use final hidden states directly (no fusion)\n        hidden_states = last_hidden_states\n    \n    # STEP 2: PREPARE EAGLE INPUTS (Triton kernel)\n    prepare_eagle_inputs(\n        self.input_buffers, input_batch, self.last_token_indices,\n        num_sampled, num_rejected, last_sampled, next_prefill_tokens,\n        self.max_num_reqs,\n    )\n    \n    # STEP 3: PREFILL — GENERATE DRAFT TOKEN 0\n    self.prefill(\n        num_reqs, prefill_batch_desc.num_tokens,\n        attn_metadata, slot_mappings,\n        num_tokens_across_dp=num_tokens_across_dp,\n        cudagraph_runtime_mode=prefill_batch_desc.cg_mode,\n        mm_inputs=mm_inputs,\n    )\n\n    \n    # STEP 4: PREPARE DECODE — TRANSITION TO AUTOREGRESSIVE MODE\n    prepare_eagle_decode(\n        self.draft_tokens[:num_reqs, 0], input_batch.seq_lens,\n        num_rejected, self.input_buffers, self.max_model_len, self.max_num_reqs,\n    )\n\n    \n    # STEP 5: DECODE LOOP — GENERATE DRAFT TOKENS 1..K-1\n    self.generate_draft(\n        num_reqs, decode_batch_desc.num_tokens,\n        attn_metadata_updated, slot_mappings_updated,\n        num_tokens_across_dp=num_tokens_across_dp,\n        cudagraph_runtime_mode=decode_batch_desc.cg_mode,\n    )\n\n    return self.draft_tokens[:num_reqs]  # [num_reqs, K]\n\n## generating draft tokens is still auto-regressive hence the for loop\ndef generate_draft(self, num_reqs, num_tokens_padded, attn_metadata, slot_mappings, ...):\n    pos = self.input_buffers.positions[:num_reqs]\n    query_start_loc = self.input_buffers.query_start_loc[:num_reqs + 1]\n    idx_mapping = self.idx_mapping[:num_reqs]\n\n    # ── ITERATE THROUGH DRAFT POSITIONS 1, 2, ..., K-1 ──\n    for step in range(1, self.num_speculative_steps):\n        # EAGLE forward: 1 token per request (decode mode)\n        # Uses hidden_states from previous step + embed(prev_draft_token) as input\n        last_hidden_states, hidden_states = self.run_model(\n            num_tokens_padded, attn_metadata, slot_mappings, ...\n        )\n        last_hidden_states = last_hidden_states[:num_reqs]\n        hidden_states = hidden_states[:num_reqs]\n\n        # We have the final output of the EAGLE model\n        # We compute logits then sample the draft tokens\n        logits = self.model.compute_logits(last_hidden_states)\n        draft_tokens = self._sample_draft(logits, idx_mapping, pos, step=step)\n        self.draft_tokens[:num_reqs, step] = draft_tokens\n\n        # ── UPDATE STATE FOR NEXT STEP (unless this is the final step) ──\n        if step < self.num_speculative_steps - 1:\n            # ...\n            update_eagle_inputs(\n                draft_tokens, hidden_states,\n                self.input_buffers, self.hidden_states, self.max_model_len,\n            )\n            # ...\n```\n\nThere is one subtlety worth calling out because it is quite interesting. When generating draft tokens beyond the first one (tokens 2 through k), we use hidden states from the draft model itself. In the diagram above (steps 2 and 3 on the right), this corresponds to using `a_i`\n\nand `a_do`\n\nfrom the draft model rather than `g_i`\n\nand `g_do`\n\nfrom the target model, which contain the true hidden states.\n\nWhen a draft token is verified and accepted, the corresponding true hidden states from the target model are then passed back to the draft model. At that point, the draft “prefill” step (step 3 in the propose method) recomputes and repopulates the draft KV cache using this corrected information because the “prefill” uses the same slot/attention metadata as the target’s.\n\nEAGLE models are trained separately and are their own models. We can find EAGLE3 models for a variety of open models, for example [here](https://huggingface.co/collections/RedHatAI/speculator-models). vLLM also has a project to train draft models, such as EAGLE, called [speculators](https://github.com/vllm-project/speculators), which integrates seamlessly with vLLM.\n\n### MTP\n\nEAGLE is a draft model that adds an extra decode path so we can efficiently predict extra draft tokens.\n\nCould we merge a similar extra layer into the target model itself and make it part of the model? That is what MTP is. Some models, such as DeepSeek-family models, include an extra multi-token prediction layer near the end of the network. When a new token is sampled, its embedding plus the hidden state of the last layer are passed to the MTP layer to predict the token that comes right after it. The same LM head and embedding are reused.\n\nThis is very similar to EAGLE, except that the model was trained with MTP from the start, it’s even part of the loss function. We can have more than one MTP layer to generate more draft tokens, or we can reuse the MTP layer to predict extra draft tokens, although accuracy will probably drop.\n\n```\n1\n2\n3\n4\n5\n6\n7\n8\n9\n10\n11\n12\n13\n14\n15\n16\n17\n18\n19\n20\n21\n22\n23\nGiven context: \"The quick brown fox jumps over the\"\n\nTarget model forward pass:\n  h = model(\"The quick brown fox jumps over the\")\n  token_1 = sample(lm_head(h)) = \"lazy\"\n  Store: h (hidden state at position \"the\")\n\n# we predicted \"lazy\" using the hidden state at position \"the\"\n# we use both in MTP layer 0\nMTP Layer 0:\n  Input: embed(\"lazy\") ⊕ h   \n  Output: h_mtp0\n  token_2 = argmax(lm_head(h_mtp0)) = \"dog\"\n\n# Same logic as above\nMTP Layer 1 (or Layer 0 reused):\n  Input: embed(\"dog\") ⊕ h_mtp0\n  Output: h_mtp1  \n  token_3 = argmax(lm_head(h_mtp1)) = \"and\"\n\nDraft: [\"lazy\", \"dog\", \"and\"]\n```\n\nIf we venture to HuggingFace and look at [DeepSeek-v4 weights](https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro?show_file_info=model.safetensors.index.json), we can observe the single MTP.0 layer sitting all by itself after the other 61 regular layers:\n\nLet’s look at the code. There’s nothing out of the ordinary. Take the input emebeddings and the hidden states that led to them, normalize so we can bring them to similar magnitudes then we concatenate and project them so they can be fed into a regular decode layer.\n\nThe generated draft token is then verified by being run through all the model layers.\n\n```\n1\n2\n3\n4\n5\n6\n7\n8\n9\n10\n11\n12\n13\n14\n15\n16\n17\n18\n19\n20\n21\n22\n23\n24\n25\n26\n27\n28\n29\n30\n31\n32\n33\n34\n35\n36\n37\n38\n39\n40\n41\n42\npython\nclass DeepSeekMultiTokenPredictorLayer(nn.Module):\n    def __init__(self, vllm_config, prefix):\n        # MTP reuses the model's own embedding + hidden-state path.\n        config = vllm_config.speculative_config.draft_model_config.hf_config\n        self.enorm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps)\n        self.hnorm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps)\n        self.eh_proj = nn.Linear(config.hidden_size * 2, config.hidden_size, bias=False)\n        self.shared_head = SharedHead(config=config, prefix=prefix, quant_config=...)\n        self.mtp_block = DeepseekV2DecoderLayer(vllm_config, prefix, config=config, topk_indices_buffer=...)\n\n    def forward(self, input_ids, positions, previous_hidden_states, inputs_embeds=None, spec_step_index=0):\n        assert inputs_embeds is not None\n        # Position 0 is masked out because MTP only needs the shifted context.\n        inputs_embeds = torch.where(positions.unsqueeze(-1) == 0, 0, inputs_embeds)\n        inputs_embeds = self.enorm(inputs_embeds)\n        previous_hidden_states = self.hnorm(previous_hidden_states)\n        # Fuse the current embedding with the previous hidden state.\n        hidden_states = self.eh_proj(torch.cat([inputs_embeds, previous_hidden_states], dim=-1))\n        # One extra decoder block turns that fused state into draft logits.\n        hidden_states, residual = self.mtp_block(positions=positions, hidden_states=hidden_states, residual=None)\n        return residual + hidden_states\n\nclass DeepSeekMultiTokenPredictor(nn.Module):\n    def __init__(self, vllm_config, prefix=\"\"):\n        config = vllm_config.model_config.hf_config\n        self.mtp_start_layer_idx = config.num_hidden_layers\n        self.num_mtp_layers = config.num_nextn_predict_layers\n        self.layers = nn.ModuleDict({...})\n        self.embed_tokens = VocabParallelEmbedding(...)\n        self.logits_processor = LogitsProcessor(config.vocab_size)\n\n    def forward(self, input_ids, positions, previous_hidden_states, inputs_embeds=None, spec_step_idx=0):\n        current_step_idx = spec_step_idx % self.num_mtp_layers # cycle throught the layer if num of draft tokens is larger than num of mtp layers\n        return self.layers[str(self.mtp_start_layer_idx + current_step_idx)](\n            input_ids, positions, previous_hidden_states, inputs_embeds, current_step_idx\n        )\n\n    def compute_logits(self, hidden_states, spec_step_idx=0):\n        mtp_layer = self.layers[str(self.mtp_start_layer_idx + (spec_step_idx % self.num_mtp_layers))]\n        # notice the \"shared_head\"\n        return self.logits_processor(mtp_layer.shared_head.head, mtp_layer.shared_head(hidden_states))\n```\n\n[CC BY 4.0](https://creativecommons.org/licenses/by/4.0/)by the author.", "url": "https://wpnews.pro/news/exploring-speculative-decoding-from-concept-to-implementation", "canonical_source": "https://cefboud.com/posts/speculative-decoding/", "published_at": "2026-05-31 00:00:00+00:00", "updated_at": "2026-06-27 16:06:47.253941+00:00", "lang": "en", "topics": ["large-language-models", "ai-infrastructure", "ai-research"], "entities": ["vLLM", "EAGLE", "MTP", "GPU"], "alternates": {"html": "https://wpnews.pro/news/exploring-speculative-decoding-from-concept-to-implementation", "markdown": "https://wpnews.pro/news/exploring-speculative-decoding-from-concept-to-implementation.md", "text": "https://wpnews.pro/news/exploring-speculative-decoding-from-concept-to-implementation.txt", "jsonld": "https://wpnews.pro/news/exploring-speculative-decoding-from-concept-to-implementation.jsonld"}}