{"slug": "sampling-args-in-llama-server", "title": "Sampling args in llama-server", "summary": "Llama.cpp users can significantly improve inference speed and output quality by tuning sampling parameters such as temperature, TopP, MinP, TopK, repeat penalty, DRY, XTC, Dynatemp, Adaptive-P, and Mirostat, which mitigate common failure modes like probability collapse, hallucination, grammar degradation, and quantization noise. The article provides a reference for these parameters, their ranges, defaults, and how to adjust them per request using the OpenAI-compatible API.", "body_md": "# Sampling args in llama-server\n\n### Reducing repetition, hallucinations, degradation, while making inference faster!\n\nllama.cpp is the most popular LLM runtime for open weight LLMs. Most beginners (including myself) used LM Studio, Jan, and Ollama but when you get a grasp on the basics, you may have much more control over the model runtime by using llama.cpp directly.\n\nThe difference is night and day. Same model may go from 10 tok/sec to 20 tok/sec when you tweak sampling. However, it’s not just about speed! These parameters impact benchmark results and eval effectiveness, yet they’re mostly underutilized.\n\nThis article is a reference for:\n\nCommon failure modes for local (especially quantized) models\n\nllama.cpp sampling parameters, what they do, their valid range, default value, and how to adjust them based on your workload (e.g. creative writing, LLM as a judge, deterministic code generation, etc.). We’ll discuss\n\nCommon params: Temperature, TopP, MinP, TopK, repeat penalty\n\nDRY\n\nXTC\n\nDynatemp\n\nAdaptive-P\n\nMirostat\n\nElaborate why older and more common switches and knobs (temperature, TopK, TopP) are not adequate and what are the modern alternatives\n\nSome tips and tricks to accelerate your experimentation loop\n\n**Note: I used Gemini 3.1 Pro Extended Thinking model in the preliminary research stage but I’ve gone through everything and heavily edited it to bear my personal name on it. All errors are mine.**\n\n# Failure modes\n\nBy explicitly setting sampling and repetition switches, we can mitigate several common failure modes:\n\n**Probability Collapse (The Infinite Loop):** The model becomes overly confident in a specific sequence (e.g., Markdown table formatting, empty JSON brackets) and gets stuck in an unrecoverable repetition loop.**Hallucination and Syntax Breakage:** Excessive unconstrained randomness (high entropy) causes the model to generate factually incorrect statements, break structured formats, or output grammatical gibberish.**Grammar Degradation:** Older, blunt token penalties blindly punish essential structural words (”the”, “a”, “{“, punctuation, etc.) simply because they appear frequently, destroying sentence coherence over long context windows.**Quantization Noise (Perplexity Spikes):** Local quantization introduces statistical artifacts into the logit distribution that static samplers struggle to handle smoothly, leading to unpredictable drops in generation quality.\n\n*Note: perplexity is a statistical metric that measures how “confused” or “surprised” a model is by the actual next word in a sentence. A low perplexity score means the model assigned a higher probability to the correct words, indicating a better understanding of the language and context.*\n\n## Startup vs. Runtime Configuration\n\nThe configurations for llama.cpp can be grouped in 2 categories:\n\n**Immutable:** set when you start the app and cannot be changed per-request. For example:`--ctx-size`\n\n.**Mutable per request:** can be set at start time but request payload (compliant with OpenAI API) can change them. For example, a given request that comes with the`temperature`\n\nvalue in its payload can override what you specify using the`--temperature`\n\nCLI argument when starting llama-server.\n\nFortunately, most sampling parameters can be set per-request, so you can experiment and iterate quickly using something like [VS Code REST Client](https://marketplace.visualstudio.com/items?itemName=humao.rest-client).\n\nHere’s an example config you can modify:\n\n```\n@host = http://localhost:8080\n@model = qwen-3.6-35B-A3B-MTP-UD\n\n### Get the properties and their values\n# To make POST request to change global properties, you need to start server with --props\nGET {{host}}/props\nContent-Type: application/json\n\n---\n\n### Send a simple request\nPOST {{host}}/v1/chat/completions\nContent-Type: application/json\n\n{\n    // Temperature controls the randomness of the output. Lower values make the output more deterministic.\n    \"temperature\": 0.1,\n    // Setting TopP\n    \"top_p\": 0.75,\n    // Maximum output tokens\n    \"max_completion_tokens\": 1024,\n    // penalize new tokens based on whether they appear in the text so far\n    \"presence_penalty\": 2,\n    // penalize new tokens based on their existing frequency in the text so far.\n    \"frequency_penalty\": 2,\n    // Exclude Top Choices (XTC)\n    \"xtc_probability\": 0.5,\n    \"xtc_threshold\": 0.1,\n    \"model\": \"{{model}}\",\n    \"messages\": [\n        {\n            \"role\": \"system\",\n            \"content\": \"You are a masterful toddler short-form storyteller.\"\n        },\n        {\n            \"role\": \"user\",\n            \"content\": \"Tell me a short story about a duck that couldn't fly.\"\n        }\n    ],\n    \"stream\": false,\n    \"return_progress\": true,\n    \"reasoning_format\": \"auto\",\n    \"chat_template_kwargs\": {\n        \"enable_thinking\": false\n    },\n    \"reasoning_control\": true,\n    \"backend_sampling\": false,\n    \"timings_per_token\": true\n}\n\n### Send a simple request\nPOST {{host}}/v1/chat/completions\nContent-Type: application/json\n\n{\n    // Temperature controls the randomness of the output. Lower values make the output more deterministic.\n    \"temperature\": 0.1,\n    // Setting TopP\n    \"top_p\": 0.75,\n    // Maximum output tokens\n    \"max_completion_tokens\": 1024,\n    // penalize new tokens based on whether they appear in the text so far\n    \"presence_penalty\": 2,\n    // penalize new tokens based on their existing frequency in the text so far.\n    \"frequency_penalty\": 2,\n    // Exclude Top Choices (XTC)\n    \"xtc_probability\": 0.5,\n    \"xtc_threshold\": 0.1,\n    \"model\": \"{{model}}\",\n    \"messages\": [\n        {\n            \"role\": \"system\",\n            \"content\": \"Your task is to finish the user's sentence with exactly one word.\"\n        },\n        {\n            \"role\": \"user\",\n            \"content\": \"United States of\"\n        }\n    ],\n    \"stream\": false,\n    \"return_progress\": true,\n    \"reasoning_format\": \"auto\",\n    \"chat_template_kwargs\": {\n        \"enable_thinking\": false\n    },\n    \"reasoning_control\": true,\n    \"backend_sampling\": false,\n    \"timings_per_token\": true\n}\n```\n\nSince there’s no harness or system prompt, you can quickly iterate through different parameter values.\n\nAnother tip is to use Gemini 3.1 Pro extended thinking to understand and set different values. Just make sure to give it ample information about your hardware and runtime environment to get good help. Always check the response against the official documentation.\n\nAnother tip is to write your command in a shell script and have it open in an editor between tweak-run cycles. Here’s an example:\n\n``` bash\n#!/usr/bin/env bash\n\nllama-server \\\n    --hf-repo unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL \\\n    --alias qwen-3.6-35B-A3B-MTP-UD \\\n    --threads 8 \\\n    --threads-batch 8 \\\n    --parallel 2 \\\n    --kv-unified \\\n    --batch-size 2048 \\\n    --ubatch-size 512 \\\n    --ctx-size 131072 \\\n    --n-predict 8192 \\\n    --reasoning-budget 1024 \\\n    --cache-ram 8192 \\\n    --n-gpu-layers all \\\n    --jinja \\\n    --cont-batching \\\n    --flash-attn on \\\n    --temp 1.0 \\\n    --top-p 0.95 \\\n    --top-k 20 \\\n    --min-p 0.00 \\\n    --samplers \"top_k;top_p;min_p;temperature;typ_p\" \\\n    --image-min-tokens 1024 \\\n    --presence-penalty 1.5 \\\n    --spec-type draft-mtp \\\n    --spec-draft-n-max 2 \\\n    --mmap \\\n    --metrics \\\n    --log-colors on \\\n    --log-verbosity 3 \\\n    --log-prompts-dir ./prompt-logs \\\n    --log-file llama-cpp.log \\\n    --host 0.0.0.0 \\\n    --port 8080\n\n# --parallel should be at least 2 to prevent /metrics requests from being cancelled.\n# W srv    load_model: cache_reuse is not supported by this context, it will be disabled\n#    --cache-reuse 256 \\\n# Disabled for speed\n#    --cache-type-k q8_0 \\\n#    --cache-type-v q4_0 \\\n```\n\n## The Execution Pipeline\n\nBefore tuning individual parameters, it is critical to understand how the sampling execution graph is constructed.\n\n`--samplers SAMPLERS_LIST`\n\n**Mechanic**: Defines the exact sequence of algorithms (semicolon separated) that applies to the raw logits.** Default**:`penalties;dry;top_n_sigma;top_k;typ_p;top_p;min_p;xtc;temperature`\n\nUnless you are explicitly tuning for a specific mathematical outcome or optimizing CPU overhead, leave this parameter blank to use the default execution order.\n\n**Critical Nuances:**\n\n**Activation via Inclusion:** Setting a command-line argument (e.g.,`--top-p 0.5`\n\n) merely configures an internal state variable. If`top_p`\n\nis*not*included in the`--samplers`\n\nlist (by default it is), then it doesn’t have any effect.**Mathematical Precedence:** The order matters. For example, the default pipeline applies`penalties`\n\nand`dry`\n\n*before*`top_p`\n\n. This ensures raw scores are penalized first, allowing truncation samplers to correctly drop heavily penalized tokens. Reversing this order could result in truncating the pool down to 10 tokens, penalizing 8 of them, and forcing the model to choose from 2 terrible remaining options.**Compute Optimization:** Sampler order impacts CPU overhead. By placing a rigid truncation sampler like`top_k`\n\nearly in the sequence, you drop thousands of long-tail logits from memory. Subsequent, computationally expensive samplers (like XTC or DRY) will then execute much faster because they only iterate over a small array of tokens (e.g., 40) instead of the model’s entire 128,000+ vocabulary.\n\n*Note: A token is the actual building block of text (a word or sub-word piece) the AI uses, whereas a logit is a raw, unnormalized numerical score that the model assigns to a given token in its vocabulary to determine which one comes next.*\n\n## 2. Basic Probability Shaping\n\nThese arguments modify the raw probability distribution of the next token *before* it is sampled. They define the vocabulary pool the model is allowed to draw from.\n\n### 2.1 Temperature\n\n**CLI parameter:**`--temperature N`\n\n**Request parameter**:`temperature`\n\n([ref](https://developers.openai.com/api/reference/resources/chat/subresources/completions/methods/create#(resource)%20chat.completions%20%3E%20(method)%20create%20%3E%20(params)%200.non_streaming%20%3E%20(param)%20temperature%20%3E%20(schema)))**Range**:`0.0`\n\n(greedy) to`2.0+`\n\n(creative). Note: although technically it’s possible to go above 2.0, it hurts the quality and usually leads to nonsensical output.**Default:**`0.80`\n\n(llama-server’s default)\n\nDivides the raw logits by (`N`\n\n) before applying the softmax* function.\n\nN < 1.0: sharpens the distribution (deterministic/greedy).\n\nN > 1.0: flattens the distribution (increases variance).\n\n*Note: * Softmax function converts raw, unnormalized prediction scores (logits) into a proper probability distribution. It guarantees all token probabilities fall between 0 and 1 and sum up to exactly 1.*\n\n**Setting temperature**:\n\nRAG/Coding/Fact-checking:\n\n`0.0`\n\nto`0.3`\n\n(Prioritize strict syntax and grounded facts).Review/Judge:\n\n`0.4`\n\nto`0.7`\n\n(Needs coherence but flexibility in reasoning).Creative Writing:\n\n`0.8`\n\nto`1.2+`\n\n(Requires strict bounds like Min-P to prevent gibberish at higher values).\n\n### 2.2 Top-P\n\n**CLI parameter:**`--top-p N`\n\n**Request parameter**:`top_p`\n\n([ref](https://developers.openai.com/api/reference/resources/chat/subresources/completions/methods/create#(resource)%20chat.completions%20%3E%20(method)%20create%20%3E%20(params)%200.non_streaming%20%3E%20(param)%20top_p%20%3E%20(schema)))**Range**:`0.0`\n\nto`1.0`\n\n(disabled).**Default:**`0.95`\n\nTop-P (also known as Nucleus Sampling) sorts tokens by probability, then retains the top tokens whose sum equals `N`\n\n. This creates dynamic truncation. If the model is highly confident, the token pool is small. If uncertain, the pool is wide.\n\nTopP of 0 is essentially the greedy sampling which selects the highest probability token.\n\nTopP of 1 is essentially like random sampling.\n\n**Setting Top-P**:\n\nCreative writing:\n\n`0.80-0.95`\n\nCoding/RAG:\n\n`0.1`\n\nto`0.5`\n\n### 2.3 Min-P\n\n**CLI parameter:**`--min-p N`\n\n**Request parameter**:`min_p`\n\n(I could not find a reference to that in OpenAI API but[OpenRouter has it](https://openrouter.ai/docs/api/reference/parameters#min-p)).**Values**:`0.0`\n\n(disabled) to`1.0`\n\n**Default:**`0.05`\n\n.\n\nTruncates any token whose probability is less than N times the probability of the *most likely* token.\n\nFor example, if the most likely token has a probability of `0.93`\n\n, a Min-P of `0.05`\n\nremoves any token which has a probability less than `0.05 x 0.93 = 0.0465`\n\nMin-P is highly effective for smaller or heavily quantized models. It dynamically scales the truncation threshold based on the model’s confidence, preventing “garbage” tokens without the hard cumulative limit of Top-P.\n\n**Setting Min-P**:\n\n`0.05`\n\nto`0.1`\n\nacross almost all use cases (the default value of`0.5`\n\nis pretty good in my experience).It allows for a high temperature (\n\n`1.5+`\n\n) in creative writing while maintaining perfect grammatical coherence.\n\n### 2.4 Top-K\n\n**CLI parameter:**`--top-k N`\n\n**Request parameter**:`top_k`\n\n**Values**:`0`\n\n(disabled) all the way to the size of the vocabulary!**Default:**`40`\n\n.\n\nTop-K sorts tokens by probability and discards all but the top `N`\n\ntokens. This is more rigid than Top-P which dynamically chooses the top possibilities to reach a specific sum.\n\n⚠️ Top-K is largely considered legacy. Use Top-P and Min-P instead. If enabled, keep it relatively high (`40 - 100`\n\n) to avoid artificially constraining the model into loops.\n\n## 3. Traditional Token-Level Penalties\n\nThese parameters apply mathematical reductions to a token’s logit based on its prior appearance in the context window. They are blunt instruments that can degrade grammar if overused.\n\n### 3.1 Penalty window\n\n**CLI parameter:**`--repeat-last-n N`\n\n**Request parameter**:`repeat_last_n`\n\n(Some clients map this as`n_keep`\n\n)**Values**:`0`\n\n(disabled),`-1`\n\n(entire context), or positive integer.**Default:**`64`\n\n.\n\nThis parameter defines the look-back window (in tokens) for all token-level penalties.\n\n**Usage**:\n\nShort structured outputs: use\n\n`-1`\n\nto look at the entire context.Long-form creative writing: use\n\n`256`\n\nto`1024`\n\nso the model can eventually reuse vocabulary.\n\n### 3.2 Presence penalty\n\n**CLI parameter:**`--presence-penalty N`\n\n**Request parameter**:`presence_penalty`\n\n([ref](https://developers.openai.com/api/reference/resources/chat/subresources/completions/methods/create#(resource)%20chat.completions%20%3E%20(method)%20create%20%3E%20(params)%200.non_streaming%20%3E%20(param)%20presence_penalty%20%3E%20(schema)))**Values**:`-2.0 to 2.0`\n\n**Default:**`0.00`\n\n(disabled).\n\nThis parameter encourages the introduction of new topics/vocabulary without punishing a word heavily for multiple uses. It subtracts a flat value (`N`\n\n) from a token’s logit if the token has appeared at least once.\n\nN > 0 penalizes new tokens based on whether they appear in the text so far, increasing the model’s likelihood to talk about new topics.\n\n⚠️\n\n`N < 0`\n\nboost repetition which may lead to loops or padding the end of the output with repetitive characters. I cannot think of a valid example but if you do, pls let me know in the comments.\n\n**Usage**: `0.1`\n\nto `0.4`\n\nfor brainstorming, creative writing, or editorial work.\n\n### 3.3 Frequency penalty\n\n**CLI Parameter:**`--frequency-penalty N`\n\n**Request parameter**:`frequency_penalty`\n\n([ref](https://developers.openai.com/api/reference/resources/chat/subresources/completions/methods/create#(resource)%20chat.completions%20%3E%20(method)%20create%20%3E%20(params)%200.non_streaming%20%3E%20(param)%20frequency_penalty%20%3E%20(schema)))**Values**: -`2.0`\n\n-`2.0`\n\n**Default:**`0.00`\n\n(disabled)\n\nPunishes repetitive verbal tics. The more a token is used, the harder it is penalized.\n\n**N > 0** penalize new tokens based on their existing frequency in the text so far. Subtracts (`N x token_count`\n\n) from the logit.⚠️\n\n`N < 0`\n\nboost repetition which may lead to loops. Again, pls let me know in the comments if you can think of a use care for negative values.\n\n**Usage**: `0.1`\n\nto `0.3`\n\nto gently suppress the overuse of specific adjectives or transition words.\n\n### 3.4 Repeat Penalty\n\n**CLI parameter:**`--repeat-penalty N`\n\n**Request parameter**:`repeat_penalty`\n\n**Values**:`1.0`\n\n(disabled) to`1.2+`\n\n**Default:**`1.00`\n\n.\n\nIt divides the logit of previously generated tokens by N.\n\n⛔ It’s generally recommended to **disable** it in favor of DRY or Presence/Frequency penalties, because repeat penalty aggressively suppresses structural words (“the”, “a”, punctuation) and easily breaks syntax.\n\n## 4. DRY\n\n“Don’t Repeat Yourself” (DRY) sampling is a more modern sequence control. It evaluates *sequences* of tokens rather than isolated tokens. This prevents catastrophic loops (like repeating Markdown tables) without degrading single-word grammar.\n\n### 4.1 DRY Multiplier\n\n**CLI parameter:**`--dry-multiplier N`\n\n**Request parameter**:`dry_multiplier`\n\n**Values**:`0.0`\n\n(disabled) to`1.0`\n\n**Default:**`0.0`\n\nMaster weight for the DRY sampling algorithm.\n\n### 4.2 DRY Allowed Length\n\n**CLI parameter:**`--dry-allowed-length N`\n\n**Request parameter**:`dry_allowed_length`\n\n**Values**: Positive integer** Default:**`2`\n\nThe sequence length threshold before penalties apply. This allows a certain degree of repetition.\n\n### 4.3 DRY Base\n\n**CLI parameter:**`--dry-base N`\n\n**Request parameter**:`dry_base`\n\n**Values**: Float`> 1.0`\n\n**Default:**`1.75`\n\nExponential scaling factor once a sequence exceeds the allowed length.\n\n### 4.4 DRY Sequence Breaker\n\n**CLI parameter:**`--dry-sequence-breaker STRING`\n\n**Request parameter**:`dry_sequence_breaker`\n\n(Passed as a string array in JSON payload).**Values:** Use \"none\" to not use any sequence breakers**Default:**\\n, :, \", *\n\nSequence breaker defines tokens/strings that reset the DRY tracker. This is essential for structured generation. For example, you want the model to be allowed to repeat structural characters (like Markdown table pipes |), but not the text itself.\n\n### 4.5 DRY Penalty last N\n\n**CLI parameter:**`--dry-penalty-last-n N`\n\n**Request parameter:**`dry_penalty_last_n`\n\n**Values:** 0 = disable, -1 = context size, or any integer up to the context length**Default:**-1\n\nThis parameter defines the size of the look-back window (in tokens) that the DRY (Don’t Repeat Yourself) sampler analyzes to detect sequence repetitions. In practice, this is the \"memory depth\" for the DRY system.\n\n**Usage:** Although it’s not very common to set this value, you may want to set the look-back window to be large enough to catch loops that span a few lines of output (For structured tasks like JSON or coding, for example).\n\n**DRY Usage:**\n\n**Coding / JSON**: Ensures the model doesn’t loop boilerplate code or output empty brackets like`}{}{}{}`\n\n`--dry-multiplier: 0.8`\n\n`--dry-allowed-length: 2`\n\n.\n\n**RAG / Fact Checking**: Use a moderate multiplier to stop the model from repeating injected context verbatim.`--dry-multiplier: 0.5`\n\n## 5. Exclude Top Choices (XTC)\n\nXTC is an **intervention-based** sampler that’s helpful for sub 14B models or heavily quantized ones (e.g. Q2).\n\nWhen the model is stuck in a loop, it usually assigns extremely high probability to the same few tokens.\n\nXTC detects these high-probability “top choices” and forcibly removes them from the pool. This forces the model to sample from the *second-best* choices.\n\nIt ignores the model’s confidence level and instead randomly injects “chaos” into the top-tier token pool.\n\nUnlike `repeat_penalty`\n\n(which might punish a word even when logically required), XTC only acts when the top-tier selection becomes repetitive, leaving underlying grammar intact.\n\n### 5.1 XTC Probability\n\n**CLI parameters:**`--xtc_probability N`\n\n**Request parameter**:`xtc_probability`\n\n**Range:**`0`\n\nto`1`\n\n(`0`\n\n= disabled)**Default:**`0.0`\n\n### 5.2 XTC Threshold\n\n**CLI parameters:**`--xtc_threshold N`\n\n**Request parameter**:`xtc_threshold`\n\n**Range:**`0`\n\nto`1`\n\n(`1`\n\n= disabled)**Default:**`0.10`\n\n**Setting XTC**:\n\nFor creative tasks, use this when the model produces “stuttering” or repetitive narrative structures:\n\n`--xtc_probability 0.5`\n\n`--xtc_threshold 0.1`\n\nFor strict coding or math, disable XTC:\n\n`--xtc_probability 0`\n\n`--xtc_threshold 1`\n\n## 6. Dynamic Temperature\n\nAdjusts temperature dynamically based on the logit distribution. If the model is confused (flat distribution), it lowers the temperature to focus it. If overconfident (spiky distribution/looping), it raises the temperature to add variance.\n\nThe dynamic temperature algorithm defines a strict numerical window and then uses an exponential curve to slide the actual applied temperature up and down within that window based on the model's entropy.\n\n### 6.1 Dynatemp Temperature Range\n\n**CLI parameter:**`--dynatemp-range N`\n\n**Default**:`0.0`\n\n(disabled)\n\nThis parameter defines the absolute maximum and minimum limits of the temperature swing, centered around your. The final temperature will be:\n\nFrom:\n\n`temperature - dynatemp_range`\n\nTo:\n\n`temperature + dynatemp_range`\n\n**Example:** If you set `--temperature 1.0`\n\nand `--dynatemp-range 0.2`\n\n, the sampler is physically hardcoded to only ever apply temperatures between `0.8`\n\nand `1.2`\n\n.\n\n### 6.2 Dynatemp Temperature exponent\n\n**CLI Args**:`--dynatemp-exp E`\n\n**Default:** 1.00\n\nWhile the range sets the floor and ceiling, the exponent dictates *how* the sampler travels between those two extremes.\n\n`C`\n\n(Confidence): The algorithm calculates a normalized “confidence score” (the inverse of entropy) between`0.0`\n\n(totally confused) and`1.0`\n\n(absolutely certain).\n\n`E`\n\n(your config): The exponent E is applied to this score before mapping it to the temperature window.\n\nThe underlying math conceptually looks like this:\n\nBy changing the exponent, you change the curve of the interpolation:\n\n`E = 1.0`\n\n**(Linear):** The temperature scales proportionately with confidence. A`50%`\n\nconfident distribution yields a temperature exactly in the middle of your range.`E > 1.0`\n\n**(Conservative/Convex):** For example, if`E = 2.0`\n\n, squaring a`0.5`\n\nconfidence score yields`0.25`\n\n. This heavily biases the output toward the**lower** end of your temperature range. The model will stay cool and focused most of the time, only spiking to the maximum temperature when it is extremely confident (which is exactly when you want to break a repetitive loop).`E < 1.0`\n\n**(Aggressive/Concave):** For example, if`E = 0.5`\n\n(a square root curve), taking the square root of`0.5`\n\nyields`~0.7`\n\n. This biases the output toward the**higher** end of your temperature range. The model will run “hot” by default and only clamp down to the minimum temperature when it is severely confused.\n\n**Usage**: This is an excellent set-and-forget alternatives to static temperature for mixed-use chat environments.\n\nA standard robust configuration sets a relatively wide range with a high exponent (e.g., `--temp 1.0`\n\n, `--dynatemp-range 0.4`\n\n, `--dynatemp-exp 2.0`\n\n).\n\nThis keeps the model operating safely near `0.6`\n\nmost of the time to ensure logical consistency, but allows it to spike rapidly toward `1.4`\n\nthe moment it detects the near-zero entropy state that precedes a repetition loop.\n\n## 7. Adaptive-p\n\nAdaptive-P is a stateful, dynamic alternative to standard Top-P (Nucleus) sampling. While a static Top-P uses a fixed cumulative probability threshold (e.g., 0.95) for every single step, Adaptive-P continuously shifts that threshold based on how confident the model has been over the last few tokens. Instead of rigidly cutting off the token pool at a fixed percentage, Adaptive-P tracks the actual probability of the tokens the model ends up selecting. It uses an Exponential Moving Average to maintain a \"running state\" of the model's confidence.\n\nThe adaptive-p sampler transforms the token probability distribution to favor tokens that fall near a user-configurable probability target.\n\nInternally, the sampler maintains an exponential moving average of the *original* probabilities of selected tokens. It uses this, along with the user’s set target, to compute an *adapted* target at each sampling step, steering the running average toward the configured target over time.\n\nIf recent selections have been higher-probability than target, the sampler compensates by temporarily favoring lower-probability tokens, and vice versa [(more info on the PR #17927)](https://github.com/ggml-org/llama.cpp/pull/17927).\n\n⚠️ Adaptive-p selects a token ID rather than just mutating candidates, so it must be last in the `--sampler`\n\nchain.\n\n### 7.1 Adaptive Target\n\n**CLI parameter:**`--adaptive-target N`\n\n**Range:**`0.0`\n\nto`1.0 (`\n\nnegative value = disabled)**Default:**-1.00\n\nThis establishes the baseline probability mass you want to capture (similar to a standard Top-P value).\n\nWhen set to a negative number, the adaptive probability transform is **disabled**, and instead it just samples normally.\n\nA good starting point is `0.55`\n\n. Then you can raise or lower the target in increments of `0.05`\n\nas you experiment.\n\nDuring generation, if the model is outputting highly predictable text (like boilerplate code), it consistently selects tokens with high probability. Adaptive-P detects this streak and shrinks the sampling threshold, effectively behaving like a very strict Top-P or greedy sampler. This prevents low-probability garbage tokens from slipping in.\n\nIf the model then encounters a complex reasoning step and the probability distribution flattens (uncertainty), the running average drops. Adaptive-P instantly widens the threshold, allowing the model to evaluate a larger, more diverse pool of tokens until its confidence stabilizes again.\n\nIn practice, it accomplishes a similar goal to Mirostat (adapting to the model’s entropy), but it achieves it by directly manipulating the Top-P cumulative mass boundary rather than targeting cross-entropy.\n\n### 7.2 Adaptive Decay Rate\n\n**CLI parameter:**`--adaptive-decay N`\n\n**Range:** 0.0 to 0.99 (Clamped to <=0.99 at init to avoid unbounded accumulation)**Default:** 0.90\n\nThis is the smoothing factor. It dictates how much “momentum” the running average has.\n\nA\n\n**higher value**(e.g., 0.95) means high inertia; the sampling threshold adapts slowly to changes in the model’s confidence.A\n\n**lower value**(e.g., 0.50) makes the sampler highly reactive, immediately widening or narrowing the token pool if the model suddenly gets confused or highly confident.\n\n## 8. Mirostat\n\nStandard samplers like Top-P, Min-P, and Top-K are stateless functions. They apply a hardcoded mathematical filter to the logit array on every single token generation, completely blind to the context of what happened in the previous step.\n\nMirostat is a stateful algorithm. It maintains a running metric of the text’s “surprise” (cross-entropy) and dynamically adjusts the truncation boundary for the *next* token based on the mathematical outcome of the *previous* token.\n\nInstead of defining a fixed probability cutoff, you define a target level of randomness. The algorithm then continuously shifts the bounds to maintain that exact level.\n\n⚠️ Mirostat selects a token ID rather than just mutating candidates, so it must be last in the `--sampler`\n\nchain. Mirostat usage, disables Top-K, Top-P, and Locally Typical samplers.\n\n### 8.1 Mirostat\n\n**CLI parameter:**`--mirostat N`\n\n**Values:** 0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0**Default:** 0\n\n### 8.2 Mirostat Learning rate\n\n**CLI parameter:**`--mirostat-lr N`\n\n**Default:** 0.1\n\nThe learning rate dictates the step size for adjusting the internal filter.\n\nA high value means the algorithm reacts very quickly to sudden changes in the model’s confidence, but it can overshoot and cause jitter.\n\nA low value provides a smoother, more gradual adjustment over multiple tokens.\n\n### 8.3 Mirostat Target entropy\n\n**CLI parameter:**`--mirostat-ent N`\n\n**Default:** 5.00\n\n**Usage**: Useful for running highly quantized models where quantization introduces severe perplexity spikes.\n\nThis is your desired baseline of randomness.\n\nA low value (e.g., 3.0) forces the algorithm to aggressively prune tokens to keep the text highly predictable and safe.\n\nA high value (e.g., 5.0 or 8.0) loosens the bounds, allowing a wider variety of vocabulary and structure.\n\nA target entropy of `5.0`\n\nkeeps the output stable.\n\n## 7. Honorable mentions\n\nThese are not exactly sampling controls but help mitigate some failure modes:\n\n`--seed 1234`\n\n: I usually pass a seed to make different server runs a bit more reproducible. The actual value doesn’t matter as long as you’re consistent.`--n-predict 2048:`\n\nI usually set a cap on how many tokens are generated. That way if the model is stuck in a loop, I don’t have to wait for the entire context length. Fail fast. I usually set the initialization value to something high because it can also be set per request using`max_completion_tokens`\n\n([ref](https://developers.openai.com/api/reference/resources/chat/subresources/completions/methods/create#(resource)%20chat.completions%20%3E%20(method)%20create%20%3E%20(params)%200.non_streaming%20%3E%20(param)%20max_completion_tokens%20%3E%20(schema))). That way, I can lower it per-request depending on what I’m expecting. One way to look at it is time: if your server emits on average`20tok/sec`\n\n, then a max value of`2400`\n\nmeans the server can go for`120`\n\nseconds (2 minutes). I think that’s reasonable if it doesn’t happen too often.`--reasoning-budget 1024:`\n\nsometimes the model gets stuck overthinking. By manually setting a thinking budget in tokens, I prevent that. In my experience with Gemma 4 26B, around 1024 tokens is more than enough and usually the model stops before hitting this limit. But if it gets stuck, I don’t want to sit there and wait.`--json-schema-file:`\n\nsuper useful for when the response should be a JSON and you don’t want to waste time and token by doing the schema validation outside the model (e.g. in the harness).`--grammar/--grammar-file:`\n\nallows enforcing rigid structural boundaries at the sampling level using[BNF-like grammar](https://en.wikipedia.org/wiki/Backus%E2%80%93Naur_form)to constrain generations ([examples](https://github.com/ggml-org/llama.cpp/tree/master/grammars)). I don’t use them because I haven’t needed them but it’s worth knowing that if you want a strict output, you can enforce it at the server level.\n\n# Conclusion\n\nIf the neural network is the “brain”, sampling acts as the hormones that control the action. Unfortunately most UIs don’t give much control over these parameters.\n\nSLMs and quantized models can severely suffer from repetition and other failure modes we mentioned at the start of the article.\n\nThe default Temperature, TopP, and even MinP go only so far.\n\nIf you want to run local models professionally, you need to stay on top of sampling or at least be able to reason about it.\n\nLlama.cpp has evolved a lot and as a result, some sampling mechanisms aren’t recommended (e.g. `repeat_penalty`\n\n) while some of the newer ones (e.g. XTC, Mirostat) boost the *emergent* behavior due to sheer complexity.\n\nUnlike `repeat_penalty`\n\n(which might punish a word even when logically required), XTC only acts when the top-tier selection becomes repetitive, leaving underlying grammar intact.\n\nA given model, quantization and workload requires some trial and error to find the right sampling algorithms and parameters. In this article we mentioned some of those tips & tricks to shortcut your iterations.\n\n# References\n\nllama-server\n\n[Sampling parameters](https://openrouter.ai/docs/api/reference/parameters#min-p)What is temperature, TopP and TopK (\n\n[YouTube](https://www.youtube.com/watch?v=jnikMver_CE))DRY sampler\n\n[PR #9702](https://github.com/ggml-org/llama.cpp/pull/9702)XTC sampler\n\n[PR #9742](https://github.com/ggml-org/llama.cpp/pull/9742)Adaptive-P sampler\n\n[PR #17927](https://github.com/ggml-org/llama.cpp/pull/17927)LLM - XTC is The Secret Sauce for RPG, Creative Writing and others (\n\n[YouTube](https://www.youtube.com/watch?v=rgyE4aMxFDo))\n\n[My monetization strategy](https://blog.alexewerlof.com/p/faq#%C2%A7payment) is to give away most content for free but these posts take anywhere from a few hours to a few days to draft, edit, research, illustrate, and publish. I pull these hours from my private time, vacation days and weekends. The simplest way to support this work is to **like**, **subscribe** and **share** it. If you really want to support me lifting our community, you can consider a paid subscription. If you want to save, you can get 20% off via [this link](https://blog.alexewerlof.com/protipsdiscount). As a token of appreciation, subscribers get full access to the Pro-Tips sections and my online book [Reliability Engineering Mindset](https://blog.alexewerlof.com/p/rem). Your contribution also funds my open-source products like [Service Level Calculator](https://slc.alexewerlof.com/). You can also [invite your friends](https://blog.alexewerlof.com/leaderboard) to gain free access or save via a [group subscription](https://blog.alexewerlof.com/subscribe?group=true).\n\n*And to those of you who already support me, thank you for sponsoring this content for the others. 🙌 If you have questions or feedback, or you want me to dig deeper into something, please let me know in the comments.*", "url": "https://wpnews.pro/news/sampling-args-in-llama-server", "canonical_source": "https://blog.alexewerlof.com/p/sampling-args-in-llama-server", "published_at": "2026-07-01 18:28:35+00:00", "updated_at": "2026-07-03 21:01:59.904326+00:00", "lang": "en", "topics": ["large-language-models", "ai-tools", "developer-tools"], "entities": ["llama.cpp", "LM Studio", "Jan", "Ollama", "Gemini", "Qwen", "VS Code REST Client"], "alternates": {"html": "https://wpnews.pro/news/sampling-args-in-llama-server", "markdown": "https://wpnews.pro/news/sampling-args-in-llama-server.md", "text": "https://wpnews.pro/news/sampling-args-in-llama-server.txt", "jsonld": "https://wpnews.pro/news/sampling-args-in-llama-server.jsonld"}}