Sampling strategies compared: temperature, top-p, top-k, min-p, and what actually works in production

A developer has mapped the four most common LLM sampling parameters—temperature, top-p, top-k, and min-p—to their concrete effects on output distributions, providing a practical guide for production deployment without relying on general-purpose defaults. The analysis shows that temperature is applied before softmax as a distribution-wide transform that can activate low-probability tokens, while top-p, top-k, and min-p truncate the distribution after softmax, with the order of operations critical since setting temperature to zero renders later truncation parameters ineffective.

You deployed a chatbot, picked temperature 0.7 because every blog post says that, and the first live user sends back screenshots of responses that drift into gibberish mid-sentence. A colleague suggests top-p 0.9. Another says top-k 50. Someone new to the team mentions min-p and claims it solves everything. You have no benchmark, no test set, and no way to tell whether any of these knobs actually fix your specific problem instead of just making the outputs shorter. This is the state of sampling parameter selection for most teams shipping LLM products. The parameters are poorly documented, they interact in non-intuitive ways, and the default values in every inference engine are tuned for general-purpose chat benchmarks, not for your use case. This post maps the four most common sampling knobs -- temperature, top-p, top-k, and min-p -- to the concrete effects they have on the output distribution, so you can pick the right one or combination without guessing. Every LLM generates text one token at a time by choosing from a probability distribution over the vocabulary. The raw distribution the logits from the final transformer layer, passed through softmax is almost never used directly. A raw distribution might assign 0.0001 probability to fifty thousand tokens and 0.3 to the top token. If you sample directly from that, you get a narrow band of high-probability continuations that sound repetitive and robotic. Sampling parameters reshape this distribution. The goal is to widen the distribution enough for creative or useful variation, but not so much that the model assigns meaningful probability to tokens that make no sense. Each parameter attacks a different failure mode: The following diagram shows how each strategy transforms the same logit distribution: php flowchart LR A Raw logits<br/ from model -- B Softmax B -- C Full probability<br/ distribution C -- D{Temperature} D -- |tau < 1| E Sharpened<br/ peaks D -- |tau 1| F Flattened<br/ tails E -- G{Top-p / Top-k / Min-p} F -- G G -- H Truncated<br/ distribution H -- I Sample<br/ next token C -- J Greedy argmax<br/ tau = 0 Each box above is a tunable step. The order matters: temperature is applied to logits before softmax, while top-p, top-k, and min-p are applied to the resulting probability distribution after softmax. If you set temperature to 0 first, the later truncation parameters have no effect because the distribution is already a delta function on the argmax token. Temperature is the oldest and most widely understood parameter. It divides the logits by tau before softmax: P token i = exp logit i / tau / sum j exp logit j / tau When tau = 1, this is the standard softmax. When tau approaches 0, the distribution converges to a one-hot vector on the highest-probability token greedy decoding . When tau is above 1, the distribution flattens, making low-probability tokens more likely than the raw model intended. Practical ranges: tau = 0 deterministic, good for code generation or factual QA , tau = 0.1-0.3 near-deterministic, useful for classification , tau = 0.6-0.9 creative writing, conversational , tau = 1.0-1.5 brainstorming, diverse generations . Above 1.5, the model increasingly produces incoherent text because it is assigning meaningful probability to tokens the model considers unlikely. The critical property of temperature is that it is a distribution-wide transform. It does not prune any tokens; it just makes the probabilities more equal tau 1 or more unequal tau < 1 . This means tau 1 can activate tokens that were essentially zero-probability in the raw distribution, including tokens that are misspellings, in the wrong language, or hallucinated -- because the model gave them low probability for a reason, and temperature is overriding that signal. Top-p, introduced by Holtzman et al. in 2019, solves a specific problem with temperature: temperature alone does not truncate the vocabulary. At tau = 0.8, the model still assigns tiny nonzero probability to thousands of tokens, and sampling from that long tail produces unexpected tokens. Top-p works by sorting tokens by probability descending, then keeping tokens from the top until their cumulative probability exceeds p. If p = 0.9, it keeps the top tokens that collectively account for 90% of the probability mass. This is adaptive: when the model is confident, top-p keeps few tokens; when uncertain, it keeps more. Practical ranges: p = 0.8-0.95 for most generation tasks. Lower values 0.5-0.7 produce more focused outputs useful for factual QA. Values above 0.95 are close to no truncation at all. The surprising property of top-p is that it can be less restrictive than top-k in high-entropy distributions, because it adapts to the distribution shape. Top-k is the simplest truncation: keep only the k tokens with the highest probability and renormalize. A common default is k = 40 or k = 50, inherited from the early GPT-2 days. The problem with top-k is that it is static. When the distribution is peaked model is confident , k = 50 keeps many low-probability tokens that should have been truncated. When the distribution is flat model is uncertain , k = 50 cuts off tokens that carry meaningful probability. Top-k works acceptably when you have tuned k for a specific domain and model, but it is fragile across models and tasks. Practical ranges: k = 10-50 for general generation. k = 1 is greedy effectively tau = 0 . k above 100 approaches no truncation for most models. Min-p, proposed by Nguyen et al. in 2024 arXiv 2407.01082 , addresses the static nature of top-k with an adaptive threshold. It works by setting a floor at min p P max , where P max is the probability of the most likely token. Tokens below this floor are discarded, and the remaining distribution is renormalized. If min p = 0.1 and the top token has probability 0.6, the floor is 0.06. Any token below 0.06 probability is pruned. When the model is confident top token near 1 , the floor is high and few tokens survive. When the model is uncertain top token at 0.3 , the floor drops and more tokens pass through. Practical ranges: min p = 0.01-0.2. Default recommendations from the paper are around 0.05-0.1 for a good balance of creativity and coherence. Values below 0.01 are close to no truncation. Values above 0.2 become very restrictive. | Parameter | What it does | Adaptive? | Common range | Best for | Key failure mode | |---|---|---|---|---|---| Temperature | Scales logits before softmax | No | 0 - 1.5 | Controlling randomness/creativity | Enables low-probability tokens without discrimination | Top-p nucleus | Keeps top tokens up to cumulative probability p | Yes adaptive count | 0.8 - 0.95 | General generation when model confidence varies | Can be too permissive in peaked distributions | Top-k | Keeps only k highest-probability tokens | No fixed count | 10 - 50 | Legacy compatibility, simple tuning | Static; either too restrictive or too permissive | Min-p | Keeps tokens with prob = min p P max | Yes adaptive threshold | 0.01 - 0.2 | Production systems needing coherence + creativity | Less tested at very large scales | In production systems, sampling parameters are almost never used alone. The most common production recipe is: Default for conversational agents: temperature = 0.7, top-p = 0.9, min-p = 0.05. This gives enough randomness for natural variation while the min-p floor prevents the model from wandering into very low-probability regions. Top-k is usually turned off set to 0 or a high value like 200 because min-p and top-p already handle truncation more adaptively. For code generation or structured output: temperature = 0.1-0.2, top-p = 0.95, min-p = 0.01. The near-zero temperature forces most probability onto the top few tokens. Top-p at 0.95 ensures that when the model is truly uncertain e.g., picking a variable name , it still has options beyond the argmax. For creative writing or brainstorming: temperature = 0.9-1.1, top-p = 0.95, min-p = 0.02. Slightly elevated temperature encourages variety. The generous top-p keeps the distribution wide. The low min-p exists mainly as a safety net against the worst long-tail tokens. For classification or extraction: temperature = 0 greedy , no truncation parameters needed. When the output space is a fixed set of labels, any sampling at all reduces accuracy. This is the rare case where the default parameters are actually optimal. Here is a Python snippet showing how vLLM combines these parameters in practice: python from vllm import SamplingParams Conversational agent params = SamplingParams temperature=0.7, top p=0.9, min p=0.05, max tokens=1024, stop= "<|im end| " Code generation code params = SamplingParams temperature=0.1, top p=0.95, min p=0.01, max tokens=2048 Classification deterministic classify params = SamplingParams temperature=0.0, max tokens=16 Stacking truncation parameters without understanding the interaction. Top-p at 0.9 and top-k at 50 at the same time means two truncations fire sequentially. Top-p might keep 30 tokens, then top-k cuts that to 50 -- which does nothing. Or top-k keeps 50, then top-p might further trim them. The effective behavior depends on which truncation applies first. Most engines apply top-k first, then top-p, then min-p. If you set all three, you are relying on an ordering you may not remember next month. Pick at most two truncation methods. Setting temperature above 1.5 and expecting coherence. Temperature is not a creativity dial. Above 1.5, the model assigns significant probability to tokens it considers extremely unlikely. The outputs may appear creative but are actually random. If you need diverse outputs, try increasing top-p or lowering min-p instead of pushing temperature beyond 1.2. Using top-k as the only sampler. This is the most common mistake I see in deployed services. A static k cannot adapt to the distribution. At k=50, sometimes you keep garbage and sometimes you cut off the valid tail. If you must use top-k alone, set k conservatively 10-20 and accept that you are leaving performance on the table. Forgetting that temperature 0 disables all sampling. If temperature is 0, the model always picks the argmax token. Top-p, top-k, and min-p have no effect because there is no distribution to truncate. If you see "temperature=0, top p=0.95" in a config, the top p is dead code. Applying sampling parameters incorrectly in batched inference. Some inference engines share sampling parameters across all sequences in a batch. Passing a per-request temperature override that conflicts with the batch default causes silent fallback to the default. Always verify that per-request sampling overrides are actually wired through the batching layer. Sampling parameters should not be the primary tool for improving output quality if: Your outputs are incoherent at temperature 0. Sampling parameters cannot fix a model that produces bad output even when it is maximally deterministic. If greedy decoding gives poor results, the problem is in the model, the prompt, or the training data, not in the sampling strategy. Add more examples to the prompt or improve the fine-tuning data before touching sampling parameters. You need guaranteed structured output. Sampling introduces nondeterminism. If the application requires valid JSON, a specific schema, or exact string matching, use constrained decoding grammar-guided generation or JSON mode instead of hoping the right parameters keep the output valid. Sampling parameters can reduce the rate of malformed output but cannot eliminate it. You are running a benchmark or eval. Every paper and leaderboard uses greedy decoding temperature 0 or a tightly controlled sampling procedure. If you compare a model at temperature 0.7 against another at temperature 0, you are measuring sampling strategy differences, not model quality differences. For evaluation, use deterministic settings and control for temperature as a variable. You have not measured the output quality. Before tuning sampling parameters, establish a metric -- accuracy on a held-out set, human preference ratings, or a task-specific score. Without a metric, every sampling parameter change is cargo-culting. Measure first, tune second. Your application uses speculative decoding. Speculative decoding's acceptance rate drops significantly at temperature 0 greedy mode compared to low-temperature sampling. If throughput is critical and you use speculation, the optimal temperature may be higher than you would choose for quality alone. Benchmark the throughput-quality tradeoff explicitly. The MCP Model Context Protocol has been called the missing standard for tool integration, but the real question is what it costs in latency, reliability, and debuggability. Next post: a production-oriented walkthrough of MCP -- how tool calls flow through the protocol, where the serialization overhead lives, and what the current ecosystem actually supports.