Llama.cpp flags auto-tuning tool

wpnews.pro

ggrun

= "gguf run". Formerly llm-server.

Stop hand-writing --tensor-split, -ot, and KV-cache flags. Point ggrun at a GGUF and it measures your GPUs, RAM, and PCIe topology, picks the backend (llama.cpp or the faster ik_llama.cpp fork), computes multi-GPU and MoE expert placement, and serves an OpenAI-compatible API. The placement is exact — read from the GGUF and measured VRAM, not guessed — so big MoE models load across mismatched GPUs that would run out of memory under other launchers.

ggrun recommend                           # rank the best models for YOUR hardware
ggrun unsloth/Qwen3.6-27B-GGUF --download # HF repo → hardware-matched quant → served
ggrun model.gguf                          # local GGUF → served
ggrun                                     # no args → interactive TUI

No flags: ggrun with no arguments opens a TUI that detects your GPUs, lists your models, computes hardware-matched launch settings, and ranks downloads by fit. Pass a model path or flags for one-shot CLI use instead.

Same rig (RTX 3090 Ti 24GB + 4070 12GB + 3060 12GB, 125GB RAM), same GGUFs, 32k context, decode tok/s (256-token generation), slowest backend on the left:

| Model (quant) | Ollama 0.30.8 | llama.cpp --fit | ggrun v3 | v3 --ai-tune | v3 vs Ollama | |---|---|---|---|---|---| | Qwen3.5-4B Q4_K_M | 124.8 | 103.3 | 151.4 | 185.7 | +49% | | Qwen3.6-27B Q5_K_M | 22.8 | 24.3 | 37.4 | 39.8 | +74% | | Qwen3.5-122B-A10B UD-IQ4_XS (MoE) | 13.5† | 21.0 | 22.9 | 23.1 | +71% | | MiniMax-M3 UD-IQ3_XXS (MoE) | ✗ won't load | ✗ won't load | 5.59 | 5.65 | Ollama can't load |

† Ollama can't import sharded GGUFs (ollama#5245), so the 122B was merged to one file before importing; MiniMax-M3 it can't load at all (minimax-m3

is ik_llama-only). Where models load, ggrun is 49–74% faster than Ollama on this rig — including +71% on the 122B MoE at heavy VRAM+RAM offload (60 GB, ~18 GB spilled to RAM). Driving the same llama.cpp master binary (no ik_llama), ggrun still beat raw --fit

, so the gain is the placement, not just the backend swap. Measured with scripts/bench-v3-comparison.sh on the rig above; full method and backend commits are in

docs/launch-performance.md.

Linux / macOS — self-contained app home under ~/ggrun

:

curl -fsSL https://raw.githubusercontent.com/raketenkater/ggrun/main/setup.sh | bash

Windows (PowerShell); add -Backend cuda

for native NVIDIA CUDA:

iwr -useb https://raw.githubusercontent.com/raketenkater/ggrun/main/install.ps1 | iex

From a clone:

git clone https://github.com/raketenkater/ggrun.git && cd ggrun && ./setup.sh

Since v3, prebuilt release bundles (Linux CPU/Vulkan, macOS arm64 Metal, Windows x86_64 CPU) install without compiling, verified against SHA256SUMS

; Linux CUDA/ik_llama.cpp builds from source for your GPU. Run ggrun

with no arguments to open the TUI. Installer options and the app-home layout are in docs/install.md.

ggrun ~/models/model.gguf                 # launch a local model
ggrun unsloth/Qwen3.6-27B-GGUF --download # download a fitting quant, then launch
ggrun model.gguf --ai-tune                # re-benchmark flags as backends change, cache the fastest
ggrun model.gguf --dry-run                # print the backend command without running
ggrun model.gguf --benchmark              # load, measure tok/s, exit

Common flags: --backend ik_llama|llama|vulkan

, --gpus 0,1

, --ctx-size

, --kv-quality

, --kv-placement

, --vram-headroom 2G

, --ram-headroom 8G

, --vision

, --spec auto

. Unknown flags pass straight through to llama-server

. Full list: docs/usage.md.

Security:ggrun binds to127.0.0.1

by default because the OpenAI-compatible API isunauthenticated. To serve trusted LAN clients, opt in with--host 0.0.0.0

(or setLLM_HOST

/ the Host setting in the TUI) and restrict access with your firewall.

vs raw llama.cpp. Upstream --fit

auto-picks GPU layers, tensor-split, and context. If that covers you, raw llama.cpp may be enough. ggrun goes further: it selects the backend (ik_llama.cpp is meaningfully faster on CUDA), picks KV-cache type and batch sizes from measured probes, benchmarks candidate flag sets (--ai-tune

), finds/validates vision projectors and speculative drafts, and recovers from crashes.

vs Ollama. Ollama wins on one-command simplicity and ecosystem on common hardware. ggrun targets where Ollama's conservative heuristics leave performance behind: mismatched multi-GPU rigs, MoE models split across VRAM/RAM, ik_llama.cpp speed, and full flag access. One GPU and want zero config? Use Ollama.

vs llama-swap. llama-swap hot-swaps between model commands you write yourself; ggrun computes those commands. They compose — point llama-swap at ggrun dry-run

output, or use ggrun daemon

for single-model swapping.

Capability	raw llama.cpp	ggrun
Multi-GPU / heterogeneous split	`--fit` (recent)
automatic, PCIe/bandwidth-weighted
MoE expert placement	`--fit` / manual `-ot`
exact per-GPU ledger, backend-aware
Backend selection (ik_llama / llama / Vulkan)	manual	automatic, dialect-aware
KV-cache type / batch sizing	manual	probe-measured
AI Tune (measured flag search)	no	yes, cached per model+hardware
Hardware-matched quant download	no	yes (HF search, capability-vs-fit ranked)
Vision projector / speculative decoding	manual	automatic, validated
Crash recovery / backend fallback	no	yes

One Go binary; Linux, macOS, and native Windows. CUDA / Vulkan / Metal / CPU.
Exact-ledger multi-GPU + MoE expert placement ( --tensor-split

+-ot

from measured VRAM and GGUF sizes), with adaptive retry on out-of-memory. AI Tune— re-benchmarks candidate flag sets as llama.cpp and ik_llama.cpp evolve and caches the fastest valid result per model + hardware, so your launch flags keep up with the backends without hand-tracking upstream. It only changes performance knobs (batch, threads, flash-attn, speculative decoding), never output quality. A community tune pool seeds first launches (LLM_COMMUNITY_TUNES=off

).- Hugging Face down with hardware-aware quant selection and a GUI recommendation picker that ranks models by capability against how well each quant fits your VRAM.

Speculative decoding (MTP, EAGLE-3, validated draft GGUFs) and vision ( mmproj

) support. - OpenAI-compatible server, arrow-key TUI, crash recovery with backend fallback.

ik_llama.cpp (CUDA, source build) · llama.cpp (Vulkan, Metal, CPU) · native Windows CUDA via install.ps1 -Backend cuda

. The backend binary is pluggable via LLAMA_SERVER

. AMD and Intel GPUs run through Vulkan (no ROCm/HIP path). macOS/Metal builds and detects Apple unified memory but is pending validation on real Apple hardware — every benchmark here is from NVIDIA CUDA.

Linux:curl

,git

,python3

;cmake

/compiler + NVIDIA CUDA toolkit for CUDA source builds;vulkaninfo

for Vulkan detection.macOS: Apple Silicon; Xcode command-line tools for source builds.Windows: Windows 10/11 x86_64, PowerShell 5+, Python.-Backend cuda

downloads prebuilt llama.cpp CUDA binaries (no toolchain needed); CUDA Toolkit + VS C++ Build Tools are only required for the from-source fallback.

Install · Usage · Architecture · Benchmarks · Speculative decoding · Model recommendations · Changelog

MIT

source & further reading

github.com — original article

Llama.cpp flags auto-tuning tool

Run your AI side-project on zahid.host