Sampling args in llama-server

Llama.cpp users can significantly improve inference speed and output quality by tuning sampling parameters such as temperature, TopP, MinP, TopK, repeat penalty, DRY, XTC, Dynatemp, Adaptive-P, and Mirostat, which mitigate common failure modes like probability collapse, hallucination, grammar degradation, and quantization noise. The article provides a reference for these parameters, their ranges, defaults, and how to adjust them per request using the OpenAI-compatible API.

Sampling args in llama-server Reducing repetition, hallucinations, degradation, while making inference faster llama.cpp is the most popular LLM runtime for open weight LLMs. Most beginners including myself used LM Studio, Jan, and Ollama but when you get a grasp on the basics, you may have much more control over the model runtime by using llama.cpp directly. The difference is night and day. Same model may go from 10 tok/sec to 20 tok/sec when you tweak sampling. However, it’s not just about speed These parameters impact benchmark results and eval effectiveness, yet they’re mostly underutilized. This article is a reference for: Common failure modes for local especially quantized models llama.cpp sampling parameters, what they do, their valid range, default value, and how to adjust them based on your workload e.g. creative writing, LLM as a judge, deterministic code generation, etc. . We’ll discuss Common params: Temperature, TopP, MinP, TopK, repeat penalty DRY XTC Dynatemp Adaptive-P Mirostat Elaborate why older and more common switches and knobs temperature, TopK, TopP are not adequate and what are the modern alternatives Some tips and tricks to accelerate your experimentation loop Note: I used Gemini 3.1 Pro Extended Thinking model in the preliminary research stage but I’ve gone through everything and heavily edited it to bear my personal name on it. All errors are mine. Failure modes By explicitly setting sampling and repetition switches, we can mitigate several common failure modes: Probability Collapse The Infinite Loop : The model becomes overly confident in a specific sequence e.g., Markdown table formatting, empty JSON brackets and gets stuck in an unrecoverable repetition loop. Hallucination and Syntax Breakage: Excessive unconstrained randomness high entropy causes the model to generate factually incorrect statements, break structured formats, or output grammatical gibberish. Grammar Degradation: Older, blunt token penalties blindly punish essential structural words ”the”, “a”, “{“, punctuation, etc. simply because they appear frequently, destroying sentence coherence over long context windows. Quantization Noise Perplexity Spikes : Local quantization introduces statistical artifacts into the logit distribution that static samplers struggle to handle smoothly, leading to unpredictable drops in generation quality. Note: perplexity is a statistical metric that measures how “confused” or “surprised” a model is by the actual next word in a sentence. A low perplexity score means the model assigned a higher probability to the correct words, indicating a better understanding of the language and context. Startup vs. Runtime Configuration The configurations for llama.cpp can be grouped in 2 categories: Immutable: set when you start the app and cannot be changed per-request. For example: --ctx-size . Mutable per request: can be set at start time but request payload compliant with OpenAI API can change them. For example, a given request that comes with the temperature value in its payload can override what you specify using the --temperature CLI argument when starting llama-server. Fortunately, most sampling parameters can be set per-request, so you can experiment and iterate quickly using something like VS Code REST Client https://marketplace.visualstudio.com/items?itemName=humao.rest-client . Here’s an example config you can modify: @host = http://localhost:8080 @model = qwen-3.6-35B-A3B-MTP-UD Get the properties and their values To make POST request to change global properties, you need to start server with --props GET {{host}}/props Content-Type: application/json --- Send a simple request POST {{host}}/v1/chat/completions Content-Type: application/json { // Temperature controls the randomness of the output. Lower values make the output more deterministic. "temperature": 0.1, // Setting TopP "top p": 0.75, // Maximum output tokens "max completion tokens": 1024, // penalize new tokens based on whether they appear in the text so far "presence penalty": 2, // penalize new tokens based on their existing frequency in the text so far. "frequency penalty": 2, // Exclude Top Choices XTC "xtc probability": 0.5, "xtc threshold": 0.1, "model": "{{model}}", "messages": { "role": "system", "content": "You are a masterful toddler short-form storyteller." }, { "role": "user", "content": "Tell me a short story about a duck that couldn't fly." } , "stream": false, "return progress": true, "reasoning format": "auto", "chat template kwargs": { "enable thinking": false }, "reasoning control": true, "backend sampling": false, "timings per token": true } Send a simple request POST {{host}}/v1/chat/completions Content-Type: application/json { // Temperature controls the randomness of the output. Lower values make the output more deterministic. "temperature": 0.1, // Setting TopP "top p": 0.75, // Maximum output tokens "max completion tokens": 1024, // penalize new tokens based on whether they appear in the text so far "presence penalty": 2, // penalize new tokens based on their existing frequency in the text so far. "frequency penalty": 2, // Exclude Top Choices XTC "xtc probability": 0.5, "xtc threshold": 0.1, "model": "{{model}}", "messages": { "role": "system", "content": "Your task is to finish the user's sentence with exactly one word." }, { "role": "user", "content": "United States of" } , "stream": false, "return progress": true, "reasoning format": "auto", "chat template kwargs": { "enable thinking": false }, "reasoning control": true, "backend sampling": false, "timings per token": true } Since there’s no harness or system prompt, you can quickly iterate through different parameter values. Another tip is to use Gemini 3.1 Pro extended thinking to understand and set different values. Just make sure to give it ample information about your hardware and runtime environment to get good help. Always check the response against the official documentation. Another tip is to write your command in a shell script and have it open in an editor between tweak-run cycles. Here’s an example: bash /usr/bin/env bash llama-server \ --hf-repo unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4 K XL \ --alias qwen-3.6-35B-A3B-MTP-UD \ --threads 8 \ --threads-batch 8 \ --parallel 2 \ --kv-unified \ --batch-size 2048 \ --ubatch-size 512 \ --ctx-size 131072 \ --n-predict 8192 \ --reasoning-budget 1024 \ --cache-ram 8192 \ --n-gpu-layers all \ --jinja \ --cont-batching \ --flash-attn on \ --temp 1.0 \ --top-p 0.95 \ --top-k 20 \ --min-p 0.00 \ --samplers "top k;top p;min p;temperature;typ p" \ --image-min-tokens 1024 \ --presence-penalty 1.5 \ --spec-type draft-mtp \ --spec-draft-n-max 2 \ --mmap \ --metrics \ --log-colors on \ --log-verbosity 3 \ --log-prompts-dir ./prompt-logs \ --log-file llama-cpp.log \ --host 0.0.0.0 \ --port 8080 --parallel should be at least 2 to prevent /metrics requests from being cancelled. W srv load model: cache reuse is not supported by this context, it will be disabled --cache-reuse 256 \ Disabled for speed --cache-type-k q8 0 \ --cache-type-v q4 0 \ The Execution Pipeline Before tuning individual parameters, it is critical to understand how the sampling execution graph is constructed. --samplers SAMPLERS LIST Mechanic : Defines the exact sequence of algorithms semicolon separated that applies to the raw logits. Default : penalties;dry;top n sigma;top k;typ p;top p;min p;xtc;temperature Unless you are explicitly tuning for a specific mathematical outcome or optimizing CPU overhead, leave this parameter blank to use the default execution order. Critical Nuances: Activation via Inclusion: Setting a command-line argument e.g., --top-p 0.5 merely configures an internal state variable. If top p is not included in the --samplers list by default it is , then it doesn’t have any effect. Mathematical Precedence: The order matters. For example, the default pipeline applies penalties and dry before top p . This ensures raw scores are penalized first, allowing truncation samplers to correctly drop heavily penalized tokens. Reversing this order could result in truncating the pool down to 10 tokens, penalizing 8 of them, and forcing the model to choose from 2 terrible remaining options. Compute Optimization: Sampler order impacts CPU overhead. By placing a rigid truncation sampler like top k early in the sequence, you drop thousands of long-tail logits from memory. Subsequent, computationally expensive samplers like XTC or DRY will then execute much faster because they only iterate over a small array of tokens e.g., 40 instead of the model’s entire 128,000+ vocabulary. Note: A token is the actual building block of text a word or sub-word piece the AI uses, whereas a logit is a raw, unnormalized numerical score that the model assigns to a given token in its vocabulary to determine which one comes next. 2. Basic Probability Shaping These arguments modify the raw probability distribution of the next token before it is sampled. They define the vocabulary pool the model is allowed to draw from. 2.1 Temperature CLI parameter: --temperature N Request parameter : temperature ref https://developers.openai.com/api/reference/resources/chat/subresources/completions/methods/create resource %20chat.completions%20%3E%20 method %20create%20%3E%20 params %200.non streaming%20%3E%20 param %20temperature%20%3E%20 schema Range : 0.0 greedy to 2.0+ creative . Note: although technically it’s possible to go above 2.0, it hurts the quality and usually leads to nonsensical output. Default: 0.80 llama-server’s default Divides the raw logits by N before applying the softmax function. N < 1.0: sharpens the distribution deterministic/greedy . N 1.0: flattens the distribution increases variance . Note: Softmax function converts raw, unnormalized prediction scores logits into a proper probability distribution. It guarantees all token probabilities fall between 0 and 1 and sum up to exactly 1. Setting temperature : RAG/Coding/Fact-checking: 0.0 to 0.3 Prioritize strict syntax and grounded facts .Review/Judge: 0.4 to 0.7 Needs coherence but flexibility in reasoning .Creative Writing: 0.8 to 1.2+ Requires strict bounds like Min-P to prevent gibberish at higher values . 2.2 Top-P CLI parameter: --top-p N Request parameter : top p ref https://developers.openai.com/api/reference/resources/chat/subresources/completions/methods/create resource %20chat.completions%20%3E%20 method %20create%20%3E%20 params %200.non streaming%20%3E%20 param %20top p%20%3E%20 schema Range : 0.0 to 1.0 disabled . Default: 0.95 Top-P also known as Nucleus Sampling sorts tokens by probability, then retains the top tokens whose sum equals N . This creates dynamic truncation. If the model is highly confident, the token pool is small. If uncertain, the pool is wide. TopP of 0 is essentially the greedy sampling which selects the highest probability token. TopP of 1 is essentially like random sampling. Setting Top-P : Creative writing: 0.80-0.95 Coding/RAG: 0.1 to 0.5 2.3 Min-P CLI parameter: --min-p N Request parameter : min p I could not find a reference to that in OpenAI API but OpenRouter has it https://openrouter.ai/docs/api/reference/parameters min-p . Values : 0.0 disabled to 1.0 Default: 0.05 . Truncates any token whose probability is less than N times the probability of the most likely token. For example, if the most likely token has a probability of 0.93 , a Min-P of 0.05 removes any token which has a probability less than 0.05 x 0.93 = 0.0465 Min-P is highly effective for smaller or heavily quantized models. It dynamically scales the truncation threshold based on the model’s confidence, preventing “garbage” tokens without the hard cumulative limit of Top-P. Setting Min-P : 0.05 to 0.1 across almost all use cases the default value of 0.5 is pretty good in my experience .It allows for a high temperature 1.5+ in creative writing while maintaining perfect grammatical coherence. 2.4 Top-K CLI parameter: --top-k N Request parameter : top k Values : 0 disabled all the way to the size of the vocabulary Default: 40 . Top-K sorts tokens by probability and discards all but the top N tokens. This is more rigid than Top-P which dynamically chooses the top possibilities to reach a specific sum. ⚠️ Top-K is largely considered legacy. Use Top-P and Min-P instead. If enabled, keep it relatively high 40 - 100 to avoid artificially constraining the model into loops. 3. Traditional Token-Level Penalties These parameters apply mathematical reductions to a token’s logit based on its prior appearance in the context window. They are blunt instruments that can degrade grammar if overused. 3.1 Penalty window CLI parameter: --repeat-last-n N Request parameter : repeat last n Some clients map this as n keep Values : 0 disabled , -1 entire context , or positive integer. Default: 64 . This parameter defines the look-back window in tokens for all token-level penalties. Usage : Short structured outputs: use -1 to look at the entire context.Long-form creative writing: use 256 to 1024 so the model can eventually reuse vocabulary. 3.2 Presence penalty CLI parameter: --presence-penalty N Request parameter : presence penalty ref https://developers.openai.com/api/reference/resources/chat/subresources/completions/methods/create resource %20chat.completions%20%3E%20 method %20create%20%3E%20 params %200.non streaming%20%3E%20 param %20presence penalty%20%3E%20 schema Values : -2.0 to 2.0 Default: 0.00 disabled . This parameter encourages the introduction of new topics/vocabulary without punishing a word heavily for multiple uses. It subtracts a flat value N from a token’s logit if the token has appeared at least once. N 0 penalizes new tokens based on whether they appear in the text so far, increasing the model’s likelihood to talk about new topics. ⚠️ N < 0 boost repetition which may lead to loops or padding the end of the output with repetitive characters. I cannot think of a valid example but if you do, pls let me know in the comments. Usage : 0.1 to 0.4 for brainstorming, creative writing, or editorial work. 3.3 Frequency penalty CLI Parameter: --frequency-penalty N Request parameter : frequency penalty ref https://developers.openai.com/api/reference/resources/chat/subresources/completions/methods/create resource %20chat.completions%20%3E%20 method %20create%20%3E%20 params %200.non streaming%20%3E%20 param %20frequency penalty%20%3E%20 schema Values : - 2.0 - 2.0 Default: 0.00 disabled Punishes repetitive verbal tics. The more a token is used, the harder it is penalized. N 0 penalize new tokens based on their existing frequency in the text so far. Subtracts N x token count from the logit.⚠️ N < 0 boost repetition which may lead to loops. Again, pls let me know in the comments if you can think of a use care for negative values. Usage : 0.1 to 0.3 to gently suppress the overuse of specific adjectives or transition words. 3.4 Repeat Penalty CLI parameter: --repeat-penalty N Request parameter : repeat penalty Values : 1.0 disabled to 1.2+ Default: 1.00 . It divides the logit of previously generated tokens by N. ⛔ It’s generally recommended to disable it in favor of DRY or Presence/Frequency penalties, because repeat penalty aggressively suppresses structural words “the”, “a”, punctuation and easily breaks syntax. 4. DRY “Don’t Repeat Yourself” DRY sampling is a more modern sequence control. It evaluates sequences of tokens rather than isolated tokens. This prevents catastrophic loops like repeating Markdown tables without degrading single-word grammar. 4.1 DRY Multiplier CLI parameter: --dry-multiplier N Request parameter : dry multiplier Values : 0.0 disabled to 1.0 Default: 0.0 Master weight for the DRY sampling algorithm. 4.2 DRY Allowed Length CLI parameter: --dry-allowed-length N Request parameter : dry allowed length Values : Positive integer Default: 2 The sequence length threshold before penalties apply. This allows a certain degree of repetition. 4.3 DRY Base CLI parameter: --dry-base N Request parameter : dry base Values : Float 1.0 Default: 1.75 Exponential scaling factor once a sequence exceeds the allowed length. 4.4 DRY Sequence Breaker CLI parameter: --dry-sequence-breaker STRING Request parameter : dry sequence breaker Passed as a string array in JSON payload . Values: Use "none" to not use any sequence breakers Default: \n, :, ", Sequence breaker defines tokens/strings that reset the DRY tracker. This is essential for structured generation. For example, you want the model to be allowed to repeat structural characters like Markdown table pipes | , but not the text itself. 4.5 DRY Penalty last N CLI parameter: --dry-penalty-last-n N Request parameter: dry penalty last n Values: 0 = disable, -1 = context size, or any integer up to the context length Default: -1 This parameter defines the size of the look-back window in tokens that the DRY Don’t Repeat Yourself sampler analyzes to detect sequence repetitions. In practice, this is the "memory depth" for the DRY system. Usage: Although it’s not very common to set this value, you may want to set the look-back window to be large enough to catch loops that span a few lines of output For structured tasks like JSON or coding, for example . DRY Usage: Coding / JSON : Ensures the model doesn’t loop boilerplate code or output empty brackets like }{}{}{} --dry-multiplier: 0.8 --dry-allowed-length: 2 . RAG / Fact Checking : Use a moderate multiplier to stop the model from repeating injected context verbatim. --dry-multiplier: 0.5 5. Exclude Top Choices XTC XTC is an intervention-based sampler that’s helpful for sub 14B models or heavily quantized ones e.g. Q2 . When the model is stuck in a loop, it usually assigns extremely high probability to the same few tokens. XTC detects these high-probability “top choices” and forcibly removes them from the pool. This forces the model to sample from the second-best choices. It ignores the model’s confidence level and instead randomly injects “chaos” into the top-tier token pool. Unlike repeat penalty which might punish a word even when logically required , XTC only acts when the top-tier selection becomes repetitive, leaving underlying grammar intact. 5.1 XTC Probability CLI parameters: --xtc probability N Request parameter : xtc probability Range: 0 to 1 0 = disabled Default: 0.0 5.2 XTC Threshold CLI parameters: --xtc threshold N Request parameter : xtc threshold Range: 0 to 1 1 = disabled Default: 0.10 Setting XTC : For creative tasks, use this when the model produces “stuttering” or repetitive narrative structures: --xtc probability 0.5 --xtc threshold 0.1 For strict coding or math, disable XTC: --xtc probability 0 --xtc threshold 1 6. Dynamic Temperature Adjusts temperature dynamically based on the logit distribution. If the model is confused flat distribution , it lowers the temperature to focus it. If overconfident spiky distribution/looping , it raises the temperature to add variance. The dynamic temperature algorithm defines a strict numerical window and then uses an exponential curve to slide the actual applied temperature up and down within that window based on the model's entropy. 6.1 Dynatemp Temperature Range CLI parameter: --dynatemp-range N Default : 0.0 disabled This parameter defines the absolute maximum and minimum limits of the temperature swing, centered around your. The final temperature will be: From: temperature - dynatemp range To: temperature + dynatemp range Example: If you set --temperature 1.0 and --dynatemp-range 0.2 , the sampler is physically hardcoded to only ever apply temperatures between 0.8 and 1.2 . 6.2 Dynatemp Temperature exponent CLI Args : --dynatemp-exp E Default: 1.00 While the range sets the floor and ceiling, the exponent dictates how the sampler travels between those two extremes. C Confidence : The algorithm calculates a normalized “confidence score” the inverse of entropy between 0.0 totally confused and 1.0 absolutely certain . E your config : The exponent E is applied to this score before mapping it to the temperature window. The underlying math conceptually looks like this: By changing the exponent, you change the curve of the interpolation: E = 1.0 Linear : The temperature scales proportionately with confidence. A 50% confident distribution yields a temperature exactly in the middle of your range. E 1.0 Conservative/Convex : For example, if E = 2.0 , squaring a 0.5 confidence score yields 0.25 . This heavily biases the output toward the lower end of your temperature range. The model will stay cool and focused most of the time, only spiking to the maximum temperature when it is extremely confident which is exactly when you want to break a repetitive loop . E < 1.0 Aggressive/Concave : For example, if E = 0.5 a square root curve , taking the square root of 0.5 yields ~0.7 . This biases the output toward the higher end of your temperature range. The model will run “hot” by default and only clamp down to the minimum temperature when it is severely confused. Usage : This is an excellent set-and-forget alternatives to static temperature for mixed-use chat environments. A standard robust configuration sets a relatively wide range with a high exponent e.g., --temp 1.0 , --dynatemp-range 0.4 , --dynatemp-exp 2.0 . This keeps the model operating safely near 0.6 most of the time to ensure logical consistency, but allows it to spike rapidly toward 1.4 the moment it detects the near-zero entropy state that precedes a repetition loop. 7. Adaptive-p Adaptive-P is a stateful, dynamic alternative to standard Top-P Nucleus sampling. While a static Top-P uses a fixed cumulative probability threshold e.g., 0.95 for every single step, Adaptive-P continuously shifts that threshold based on how confident the model has been over the last few tokens. Instead of rigidly cutting off the token pool at a fixed percentage, Adaptive-P tracks the actual probability of the tokens the model ends up selecting. It uses an Exponential Moving Average to maintain a "running state" of the model's confidence. The adaptive-p sampler transforms the token probability distribution to favor tokens that fall near a user-configurable probability target. Internally, the sampler maintains an exponential moving average of the original probabilities of selected tokens. It uses this, along with the user’s set target, to compute an adapted target at each sampling step, steering the running average toward the configured target over time. If recent selections have been higher-probability than target, the sampler compensates by temporarily favoring lower-probability tokens, and vice versa more info on the PR 17927 https://github.com/ggml-org/llama.cpp/pull/17927 . ⚠️ Adaptive-p selects a token ID rather than just mutating candidates, so it must be last in the --sampler chain. 7.1 Adaptive Target CLI parameter: --adaptive-target N Range: 0.0 to 1.0 negative value = disabled Default: -1.00 This establishes the baseline probability mass you want to capture similar to a standard Top-P value . When set to a negative number, the adaptive probability transform is disabled , and instead it just samples normally. A good starting point is 0.55 . Then you can raise or lower the target in increments of 0.05 as you experiment. During generation, if the model is outputting highly predictable text like boilerplate code , it consistently selects tokens with high probability. Adaptive-P detects this streak and shrinks the sampling threshold, effectively behaving like a very strict Top-P or greedy sampler. This prevents low-probability garbage tokens from slipping in. If the model then encounters a complex reasoning step and the probability distribution flattens uncertainty , the running average drops. Adaptive-P instantly widens the threshold, allowing the model to evaluate a larger, more diverse pool of tokens until its confidence stabilizes again. In practice, it accomplishes a similar goal to Mirostat adapting to the model’s entropy , but it achieves it by directly manipulating the Top-P cumulative mass boundary rather than targeting cross-entropy. 7.2 Adaptive Decay Rate CLI parameter: --adaptive-decay N Range: 0.0 to 0.99 Clamped to <=0.99 at init to avoid unbounded accumulation Default: 0.90 This is the smoothing factor. It dictates how much “momentum” the running average has. A higher value e.g., 0.95 means high inertia; the sampling threshold adapts slowly to changes in the model’s confidence.A lower value e.g., 0.50 makes the sampler highly reactive, immediately widening or narrowing the token pool if the model suddenly gets confused or highly confident. 8. Mirostat Standard samplers like Top-P, Min-P, and Top-K are stateless functions. They apply a hardcoded mathematical filter to the logit array on every single token generation, completely blind to the context of what happened in the previous step. Mirostat is a stateful algorithm. It maintains a running metric of the text’s “surprise” cross-entropy and dynamically adjusts the truncation boundary for the next token based on the mathematical outcome of the previous token. Instead of defining a fixed probability cutoff, you define a target level of randomness. The algorithm then continuously shifts the bounds to maintain that exact level. ⚠️ Mirostat selects a token ID rather than just mutating candidates, so it must be last in the --sampler chain. Mirostat usage, disables Top-K, Top-P, and Locally Typical samplers. 8.1 Mirostat CLI parameter: --mirostat N Values: 0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0 Default: 0 8.2 Mirostat Learning rate CLI parameter: --mirostat-lr N Default: 0.1 The learning rate dictates the step size for adjusting the internal filter. A high value means the algorithm reacts very quickly to sudden changes in the model’s confidence, but it can overshoot and cause jitter. A low value provides a smoother, more gradual adjustment over multiple tokens. 8.3 Mirostat Target entropy CLI parameter: --mirostat-ent N Default: 5.00 Usage : Useful for running highly quantized models where quantization introduces severe perplexity spikes. This is your desired baseline of randomness. A low value e.g., 3.0 forces the algorithm to aggressively prune tokens to keep the text highly predictable and safe. A high value e.g., 5.0 or 8.0 loosens the bounds, allowing a wider variety of vocabulary and structure. A target entropy of 5.0 keeps the output stable. 7. Honorable mentions These are not exactly sampling controls but help mitigate some failure modes: --seed 1234 : I usually pass a seed to make different server runs a bit more reproducible. The actual value doesn’t matter as long as you’re consistent. --n-predict 2048: I usually set a cap on how many tokens are generated. That way if the model is stuck in a loop, I don’t have to wait for the entire context length. Fail fast. I usually set the initialization value to something high because it can also be set per request using max completion tokens ref https://developers.openai.com/api/reference/resources/chat/subresources/completions/methods/create resource %20chat.completions%20%3E%20 method %20create%20%3E%20 params %200.non streaming%20%3E%20 param %20max completion tokens%20%3E%20 schema . That way, I can lower it per-request depending on what I’m expecting. One way to look at it is time: if your server emits on average 20tok/sec , then a max value of 2400 means the server can go for 120 seconds 2 minutes . I think that’s reasonable if it doesn’t happen too often. --reasoning-budget 1024: sometimes the model gets stuck overthinking. By manually setting a thinking budget in tokens, I prevent that. In my experience with Gemma 4 26B, around 1024 tokens is more than enough and usually the model stops before hitting this limit. But if it gets stuck, I don’t want to sit there and wait. --json-schema-file: super useful for when the response should be a JSON and you don’t want to waste time and token by doing the schema validation outside the model e.g. in the harness . --grammar/--grammar-file: allows enforcing rigid structural boundaries at the sampling level using BNF-like grammar https://en.wikipedia.org/wiki/Backus%E2%80%93Naur form to constrain generations examples https://github.com/ggml-org/llama.cpp/tree/master/grammars . I don’t use them because I haven’t needed them but it’s worth knowing that if you want a strict output, you can enforce it at the server level. Conclusion If the neural network is the “brain”, sampling acts as the hormones that control the action. Unfortunately most UIs don’t give much control over these parameters. SLMs and quantized models can severely suffer from repetition and other failure modes we mentioned at the start of the article. The default Temperature, TopP, and even MinP go only so far. If you want to run local models professionally, you need to stay on top of sampling or at least be able to reason about it. Llama.cpp has evolved a lot and as a result, some sampling mechanisms aren’t recommended e.g. repeat penalty while some of the newer ones e.g. XTC, Mirostat boost the emergent behavior due to sheer complexity. Unlike repeat penalty which might punish a word even when logically required , XTC only acts when the top-tier selection becomes repetitive, leaving underlying grammar intact. A given model, quantization and workload requires some trial and error to find the right sampling algorithms and parameters. In this article we mentioned some of those tips & tricks to shortcut your iterations. References llama-server Sampling parameters https://openrouter.ai/docs/api/reference/parameters min-p What is temperature, TopP and TopK YouTube https://www.youtube.com/watch?v=jnikMver CE DRY sampler PR 9702 https://github.com/ggml-org/llama.cpp/pull/9702 XTC sampler PR 9742 https://github.com/ggml-org/llama.cpp/pull/9742 Adaptive-P sampler PR 17927 https://github.com/ggml-org/llama.cpp/pull/17927 LLM - XTC is The Secret Sauce for RPG, Creative Writing and others YouTube https://www.youtube.com/watch?v=rgyE4aMxFDo My monetization strategy https://blog.alexewerlof.com/p/faq %C2%A7payment is to give away most content for free but these posts take anywhere from a few hours to a few days to draft, edit, research, illustrate, and publish. I pull these hours from my private time, vacation days and weekends. The simplest way to support this work is to like , subscribe and share it. If you really want to support me lifting our community, you can consider a paid subscription. If you want to save, you can get 20% off via this link https://blog.alexewerlof.com/protipsdiscount . As a token of appreciation, subscribers get full access to the Pro-Tips sections and my online book Reliability Engineering Mindset https://blog.alexewerlof.com/p/rem . Your contribution also funds my open-source products like Service Level Calculator https://slc.alexewerlof.com/ . You can also invite your friends https://blog.alexewerlof.com/leaderboard to gain free access or save via a group subscription https://blog.alexewerlof.com/subscribe?group=true . And to those of you who already support me, thank you for sponsoring this content for the others. 🙌 If you have questions or feedback, or you want me to dig deeper into something, please let me know in the comments.