cd /news/large-language-models/speculative-decoding-shifted-our-out… · home topics large-language-models article
[ARTICLE · art-32207] src=dev.to ↗ pub= topic=large-language-models verified=true sentiment=↓ negative

Speculative decoding shifted our output distribution and evals missed it

Nexus Labs enabled speculative decoding in vLLM for a fine-tuned 8B model, achieving a 1.9x throughput gain, but discovered that greedy decoding with a draft model is not bit-identical to greedy decoding without one due to float16 arithmetic differences. This caused a 1.2% output divergence in tool-call arguments, which offline evals missed because they tested a different serving path. The team fixed the issue by routing all eval traffic through the same serving endpoint and adding CI assertions to detect decode path drift.

read5 min views2 publishedJun 18, 2026

TL;DR: We turned on speculative decoding in vLLM to cut latency on a fine-tuned 8B. Got a 1.9x throughput win. Three weeks later a customer flagged that the agent's tool-call arguments had subtly changed. Greedy decoding with a draft model is not bit-identical to greedy decoding without one, and our offline evals never caught the drift because they ran on a different serving path.

I lead the eval team at Nexus Labs. We do enterprise agent automation, Series B, about 14 people in engineering. The model we fine-tune is a Llama-3.1-8B variant that drives tool calls. Latency matters because each agent turn can chain 4 or 5 calls.

So we enabled speculative decoding. Draft model was a distilled 1B. Target was our 8B. The pitch is simple: the draft proposes tokens, the target verifies them in one forward pass, you accept the longest matching prefix. When acceptance is high you get tokens nearly for free.

The throughput number was real. 1.9x at our batch sizes. The problem was everything we assumed about correctness.

The vLLM docs say speculative decoding is lossless for greedy. That is true in exact arithmetic. It is not true in float16 on a GPU.

Here is the thing nobody tells you. The verification step recomputes logits for the drafted tokens in a batched forward pass. The target model alone computes them token-by-token. Different batch shapes, different kernel paths, different reduction order. The argmax usually agrees. Usually.

When the top two logits are within a few thousandths of each other, the batched path and the sequential path can pick different tokens. For most text that is invisible. For structured tool-call output where one token flips "limit": 50

to "limit": 500

, it is not invisible at all.

We measured it. Ran the same 2,000 prompts through both paths, greedy, temperature 0.

Serving path Exact-match outputs Tool-arg mismatch Tokens/sec
Target only (no spec) baseline 0% 41
Spec decode, 1B draft 98.8% 1.2% 78
Spec decode, 3B draft 99.4% 0.6% 64

1.2% of outputs differed. On agent traffic that chains calls, a 1.2% per-call divergence compounds. Over a 5-call session that's roughly a 6% chance at least one call drifts.

This is the part I'm actually annoyed about. Our offline eval suite hit the model directly through the HF generate()

API. No speculative decoding. No batched verification. Our production serving stack ran vLLM with spec decode on.

We were evaluating one numerical path and shipping another. The eval harness was honest about the model it tested. It just wasn't testing the model we served.

The fix was boring and correct: evaluate against the exact serving endpoint. We route all eval traffic through the same gateway the app uses, so the eval client and the production client are indistinguishable to the backend. We use Bifrost in front of our vLLM and external providers, which gave us one OpenAI-compatible endpoint to point both at. The point isn't the tool. The point is your eval requests must traverse the identical decode path, kernels included.

Here's the config flag that matters in vLLM:

model: /models/nexus-8b-toolcall
speculative_config:
  model: /models/nexus-1b-draft
  num_speculative_tokens: 5
speculative_disable_logprobs: false

And the eval-side assertion we added so this never ships silently again:

resp = client.chat.completions.create(
    model="nexus-8b-toolcall",
    messages=msgs,
    temperature=0,
    extra_body={"spec_decode": True},  # must match prod
)
assert resp.system_fingerprint == EXPECTED_FINGERPRINT,     f"decode path drift: {resp.system_fingerprint}"

We compute a fingerprint from the serving config (draft model hash, num_speculative_tokens, kernel version) and assert it. If someone bumps vLLM or swaps the draft, CI goes red before the eval numbers are trusted.

We kept speculative decoding. The latency win was worth more than 1.2% drift for most of our endpoints. But we did three things.

First, we raised the bar on tool-call endpoints specifically. For the two customers running financial workflows, we run target-only, no draft. Slower, exact. They opted in to the cost.

Second, we started running a nightly divergence canary that replays 500 prompts through both serving paths and alerts if mismatch exceeds 1.5%. This caught a vLLM upgrade that shifted draft acceptance logic and pushed mismatch to 2.1%.

Third, all eval traffic now routes through the production endpoint. No more generate()

in the harness. If the serving path changes, the eval changes with it.

This costs you reproducibility. Pinning evals to the serving path means a kernel update can move your eval scores even when the weights are frozen. That is correct, but it means "the model regressed" and "the runtime changed" now look the same on the dashboard. You need the fingerprint to tell them apart.

The fingerprint approach is only as good as what you hash. We hash config, not the actual CUDA kernel binary. A driver update that changes reduction order without changing our config would slip through. The nightly canary is the backstop for that, not the assertion.

Target-only serving for the exact endpoints roughly halved throughput for those customers. We ate that. Bigger draft models shrink the gap but cost more memory and reduce acceptance, so 3B was not a free win either.

And 1.2% is our number, on our model, at our logit margins. A model with sharper output distributions will diverge less. One with flatter logits will diverge more. Measure your own.

── more in #large-language-models 4 stories · sorted by recency
── more on @nexus labs 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/speculative-decoding…] indexed:0 read:5min 2026-06-18 ·