{"slug": "speculative-decoding-shifted-our-output-distribution-and-evals-missed-it", "title": "Speculative decoding shifted our output distribution and evals missed it", "summary": "Nexus Labs enabled speculative decoding in vLLM for a fine-tuned 8B model, achieving a 1.9x throughput gain, but discovered that greedy decoding with a draft model is not bit-identical to greedy decoding without one due to float16 arithmetic differences. This caused a 1.2% output divergence in tool-call arguments, which offline evals missed because they tested a different serving path. The team fixed the issue by routing all eval traffic through the same serving endpoint and adding CI assertions to detect decode path drift.", "body_md": "**TL;DR: We turned on speculative decoding in vLLM to cut latency on a fine-tuned 8B. Got a 1.9x throughput win. Three weeks later a customer flagged that the agent's tool-call arguments had subtly changed. Greedy decoding with a draft model is not bit-identical to greedy decoding without one, and our offline evals never caught the drift because they ran on a different serving path.**\n\nI lead the eval team at Nexus Labs. We do enterprise agent automation, Series B, about 14 people in engineering. The model we fine-tune is a Llama-3.1-8B variant that drives tool calls. Latency matters because each agent turn can chain 4 or 5 calls.\n\nSo we enabled speculative decoding. Draft model was a distilled 1B. Target was our 8B. The pitch is simple: the draft proposes tokens, the target verifies them in one forward pass, you accept the longest matching prefix. When acceptance is high you get tokens nearly for free.\n\nThe throughput number was real. 1.9x at our batch sizes. The problem was everything we assumed about correctness.\n\nThe vLLM docs say speculative decoding is lossless for greedy. That is true in exact arithmetic. It is not true in float16 on a GPU.\n\nHere is the thing nobody tells you. The verification step recomputes logits for the drafted tokens in a batched forward pass. The target model alone computes them token-by-token. Different batch shapes, different kernel paths, different reduction order. The argmax usually agrees. Usually.\n\nWhen the top two logits are within a few thousandths of each other, the batched path and the sequential path can pick different tokens. For most text that is invisible. For structured tool-call output where one token flips `\"limit\": 50`\n\nto `\"limit\": 500`\n\n, it is not invisible at all.\n\nWe measured it. Ran the same 2,000 prompts through both paths, greedy, temperature 0.\n\n| Serving path | Exact-match outputs | Tool-arg mismatch | Tokens/sec |\n|---|---|---|---|\n| Target only (no spec) | baseline | 0% | 41 |\n| Spec decode, 1B draft | 98.8% | 1.2% | 78 |\n| Spec decode, 3B draft | 99.4% | 0.6% | 64 |\n\n1.2% of outputs differed. On agent traffic that chains calls, a 1.2% per-call divergence compounds. Over a 5-call session that's roughly a 6% chance at least one call drifts.\n\nThis is the part I'm actually annoyed about. Our offline eval suite hit the model directly through the HF `generate()`\n\nAPI. No speculative decoding. No batched verification. Our production serving stack ran vLLM with spec decode on.\n\nWe were evaluating one numerical path and shipping another. The eval harness was honest about the model it tested. It just wasn't testing the model we served.\n\nThe fix was boring and correct: evaluate against the exact serving endpoint. We route all eval traffic through the same gateway the app uses, so the eval client and the production client are indistinguishable to the backend. We use Bifrost in front of our vLLM and external providers, which gave us one OpenAI-compatible endpoint to point both at. The point isn't the tool. The point is your eval requests must traverse the identical decode path, kernels included.\n\nHere's the config flag that matters in vLLM:\n\n```\n# vllm serving config\nmodel: /models/nexus-8b-toolcall\nspeculative_config:\n  model: /models/nexus-1b-draft\n  num_speculative_tokens: 5\n# this is the one we missed:\n# disable_logprobs_during_spec_decoding defaults vary by version.\n# pin it and assert it in CI.\nspeculative_disable_logprobs: false\n```\n\nAnd the eval-side assertion we added so this never ships silently again:\n\n```\n# fail CI if eval path != serving path\nresp = client.chat.completions.create(\n    model=\"nexus-8b-toolcall\",\n    messages=msgs,\n    temperature=0,\n    extra_body={\"spec_decode\": True},  # must match prod\n)\nassert resp.system_fingerprint == EXPECTED_FINGERPRINT,     f\"decode path drift: {resp.system_fingerprint}\"\n```\n\nWe compute a fingerprint from the serving config (draft model hash, num_speculative_tokens, kernel version) and assert it. If someone bumps vLLM or swaps the draft, CI goes red before the eval numbers are trusted.\n\nWe kept speculative decoding. The latency win was worth more than 1.2% drift for most of our endpoints. But we did three things.\n\nFirst, we raised the bar on tool-call endpoints specifically. For the two customers running financial workflows, we run target-only, no draft. Slower, exact. They opted in to the cost.\n\nSecond, we started running a nightly divergence canary that replays 500 prompts through both serving paths and alerts if mismatch exceeds 1.5%. This caught a vLLM upgrade that shifted draft acceptance logic and pushed mismatch to 2.1%.\n\nThird, all eval traffic now routes through the production endpoint. No more `generate()`\n\nin the harness. If the serving path changes, the eval changes with it.\n\nThis costs you reproducibility. Pinning evals to the serving path means a kernel update can move your eval scores even when the weights are frozen. That is correct, but it means \"the model regressed\" and \"the runtime changed\" now look the same on the dashboard. You need the fingerprint to tell them apart.\n\nThe fingerprint approach is only as good as what you hash. We hash config, not the actual CUDA kernel binary. A driver update that changes reduction order without changing our config would slip through. The nightly canary is the backstop for that, not the assertion.\n\nTarget-only serving for the exact endpoints roughly halved throughput for those customers. We ate that. Bigger draft models shrink the gap but cost more memory and reduce acceptance, so 3B was not a free win either.\n\nAnd 1.2% is our number, on our model, at our logit margins. A model with sharper output distributions will diverge less. One with flatter logits will diverge more. Measure your own.", "url": "https://wpnews.pro/news/speculative-decoding-shifted-our-output-distribution-and-evals-missed-it", "canonical_source": "https://dev.to/marcuswwchen/speculative-decoding-shifted-our-output-distribution-and-evals-missed-it-4dci", "published_at": "2026-06-18 06:31:41+00:00", "updated_at": "2026-06-18 06:51:40.784548+00:00", "lang": "en", "topics": ["large-language-models", "ai-infrastructure", "mlops", "developer-tools"], "entities": ["Nexus Labs", "vLLM", "Llama-3.1-8B", "Bifrost", "Hugging Face"], "alternates": {"html": "https://wpnews.pro/news/speculative-decoding-shifted-our-output-distribution-and-evals-missed-it", "markdown": "https://wpnews.pro/news/speculative-decoding-shifted-our-output-distribution-and-evals-missed-it.md", "text": "https://wpnews.pro/news/speculative-decoding-shifted-our-output-distribution-and-evals-missed-it.txt", "jsonld": "https://wpnews.pro/news/speculative-decoding-shifted-our-output-distribution-and-evals-missed-it.jsonld"}}