{"slug": "train-the-draft-model-for-your-workload", "title": "Train the draft model for your workload", "summary": "Nebius launched Custom Speculator Training in Token Factory, enabling teams to train workload-specific draft models from their own data and deploy them alongside base models in one workflow. The feature addresses performance plateaus in speculative decoding for domain-specific workloads like code assistants and enterprise copilots, improving latency predictability and acceptance rates.", "body_md": "Train the draft model for your workload\n\nTrain the draft model for your workload\n\nCustom Speculator Training is now in Token Factory. Move from production data to a workload-specific draft model, then deploy it alongside the base model, in one workflow.\n\n[Train the draft model for your workload](#train-the-draft-model-for-your-workload)Train the draft model for your workload\n\nSpeculative decoding is usually introduced as a throughput trick: run a small, fast draft model ahead of the large one, verify its guesses in parallel, and watch tokens per second climb. That framing undersells what it actually does in production. In real systems, speculative decoding reshapes the execution path that decides whether p90 and p99 latency stay predictable under load.\n\nWhich is why the draft model matters more than most teams realize.\n\nA generic draft model is tuned for average traffic. It doesn’t know that your users write code in Go, or that your agent repeatedly asks for structured JSON, or that your copilot spends most of its day completing the same four or five prompt shapes. On recurring, domain-specific workloads, a generic drafter plateaus, and the gap between benchmark performance and production behavior becomes the thing your SLOs absorb.\n\nToday we are launching Custom Speculator Training in Nebius Token Factory. Teams can train workload-specific draft models from their own data, then deploy them inside the same platform that serves the base model.\n\n[Why generic speculators plateau](#why-generic-speculators-plateau)Why generic speculators plateau\n\nSpeculative decoding works by having the draft model propose tokens that the target model verifies. The more often proposals are accepted, the larger the throughput and latency win. Acceptance rate is workload-dependent: a drafter trained on general instruction data can only approximate the distribution your product actually generates.\n\nFor narrow, high-volume workloads (code assistants, enterprise copilots, structured-output agents, chat with repeated prompt shapes) that approximation leaves real performance on the table. A drafter trained on the traffic you serve narrows that gap.\n\nThe goal is not a better benchmark number. The goal is a more predictable execution path.\n\n[What you can do with Custom Speculator Training](#what-you-can-do-with-custom-speculator-training)What you can do with Custom Speculator Training\n\nInside Token Factory, the workflow is one loop.\n\n-\n**Bring your training data.** Use Data Lab to attach S3-compatible files in place, query production logs, and merge them with synthetic seed datasets. Or upload a curated JSONL file directly through the Post-training console or the Files API. For the launch set of models, we are providing synthetic datasets to 6 leading OS frontier models for free to accelerate your training, so teams can start training even when their own logs alone are not enough. -\n**Train the drafter.** Pick the target base model, point the training job at your prepared dataset, run a job from the Post-training page. Preset hyperparameters cover the common case. For advanced users, full hyperparameter control is exposed (decoding heads, loss type, learning rate, scheduler, drafter type) without making that complexity the default. Training uses the same Post-training pipeline that already runs supervised fine-tuning on Papyrax, our distributed training framework. -\n**Deploy alongside the base model.** The trained drafter lands in Custom Weights. From there, it attaches to a Dedicated Endpoint with the base model and serves real traffic. At launch, deployment is reviewed by our solutions engineers, so teams get a human-in-the-loop check before a custom drafter sees production load. -\n**Measure on the workload that matters.** Acceptance rate, tokens per second, end-to-end latency, all visible alongside the rest of your serving metrics. If the workload changes materially, retrain.\n\nThe June 25, 2026 launch covers the API and UI, Eagle 3 and our Nebius-optimized Eagle 3 architectures, and the current supported model set configured in Papyrax. Jobs start from the Post-training page. Data preparation rides on Data Lab, which has been live since May.\n\nThe same launch package also points toward broader support over time. Public model-support claims stay tied to the approved launch set rather than implying universal arbitrary-model coverage on day one.\n\n[Built on a platform already running](#built-on-a-platform-already-running)Built on a platform already running\n\nCustom Speculator Training isn’t a standalone feature. It plugs into rails that have been settling into production for months. **Post-training** has been live since December 2025. Data Lab shipped in mid-May with S3 import in place and the dataset merge workflow. RL Platform v1 entered private preview a few days later. Custom Speculator is the moment the loop closes: customers now shape both model behavior and serving behavior on their own data, in one platform.\n\n[Built on research we ship in the open](#built-on-research-we-ship-in-the-open)Built on research we ship in the open\n\nOur research team has published ** LK Losses**, a novel loss function that directly optimizes draft-token acceptance rate rather than proxying it through KL divergence, along with the\n\n**LK-Speculators** model family on Hugging Face and a public contribution to\n\n**SpecForge**. Across models from 8B to 685B parameters, LK Losses show consistent improvements in acceptance rate over KL-trained baselines.\n\nThe research is starting to shape the broader open ecosystem too. The vLLM project recently picked up an upstream change inspired by the LK Losses analysis of probabilistic rejection sampling, fixing how draft proposal probabilities flow through the verification path. The work that powers Custom Speculator Training is the work the open community is now building on.\n\nThat research is what Custom Speculator Training productizes. The open artifacts stay open. The training pipeline, the data tooling, and the deployment path are what Token Factory adds on top.\n\n[What this changes in production](#what-this-changes-in-production)What this changes in production\n\n**Better throughput on recurring workloads.** Speculative decoding performs best when the drafter and target agree often. A drafter trained on the distribution your product actually serves raises that agreement rate where it matters.\n\n**More stable latency under load.** Acceptance rate is about predictability as much as speed. A workload-matched drafter tightens the gap between median and tail latency on the traffic your system sees every day.\n\n**Better unit economics as volume scales.** Faster serving reduces GPU-seconds per request. That matters most where it always matters most: at the point where a product graduates from successful pilot to expensive steady state.\n\n**Serving behavior you control.** Public APIs do not let you train the fast path. When your drafter is trained on your data, the part of the stack that most affects how your product feels is no longer commodity.\n\n**Control surfaces that match the team.** For most teams, preset hyperparameters are the right starting point. For the teams who want to tune (decoding heads, loss type, learning rate, scheduler, drafter architecture), the control is there without being in the way.\n\n[Who this is for](#who-this-is-for)Who this is for\n\nThis launch is built for teams that already have production traffic and care about how it performs:\n\n-\nML and platform engineers running open-source models behind real products: copilots, agents, coding tools, enterprise chat, reasoning systems.\n\n-\nAI product teams whose workload is concentrated enough that a workload-specific drafter can plausibly outperform a generic one.\n\n-\nEnterprise accounts already using generic speculative decoding on Token Factory who have hit the ceiling of what a general drafter can do on their traffic.\n\nIf you are still shipping a first prototype on highly variable traffic, a generic drafter is the right starting point. Come back when you have repeatable workload patterns and a performance target worth chasing.\n\n[How to get started](#how-to-get-started)How to get started\n\nCustom Speculator Training goes live on June 25. If you are a current Token Factory customer with a production workload and a concrete performance goal, talk to your solutions engineer. We will help you scope the dataset, run the first training job, and review the deployment before it starts serving.\n\nIf you are new to Token Factory, start with a Dedicated Endpoint on the base model, gather traffic, then come back to train the drafter once you have enough signal.\n\nThe real question in production is no longer whether speculative decoding exists. It is whether the draft model is shaped for the traffic you actually serve. Custom Speculator Training is how you do that [inside Token Factory](https://tokenfactory.nebius.com/post-training/new-job).", "url": "https://wpnews.pro/news/train-the-draft-model-for-your-workload", "canonical_source": "https://nebius.com/blog/posts/train-the-draft-model-for-your-workload", "published_at": "2026-06-26 10:39:45+00:00", "updated_at": "2026-06-26 11:05:46.062962+00:00", "lang": "en", "topics": ["large-language-models", "ai-infrastructure", "ai-tools", "ai-products"], "entities": ["Nebius", "Token Factory", "Data Lab", "Papyrax", "Eagle 3"], "alternates": {"html": "https://wpnews.pro/news/train-the-draft-model-for-your-workload", "markdown": "https://wpnews.pro/news/train-the-draft-model-for-your-workload.md", "text": "https://wpnews.pro/news/train-the-draft-model-for-your-workload.txt", "jsonld": "https://wpnews.pro/news/train-the-draft-model-for-your-workload.jsonld"}}