Llama.cpp flags auto-tuning tool

Llama.cpp developer released ggrun, an auto-tuning tool that measures GPU, RAM, and PCIe topology to compute optimal multi-GPU and MoE expert placement for GGUF models, serving an OpenAI-compatible API. Benchmarks show ggrun outperforms Ollama by 49-74% on a three-GPU rig, including loading models Ollama cannot handle.

ggrun = "gguf run". Formerly llm-server . Stop hand-writing --tensor-split, -ot, and KV-cache flags. Point ggrun at a GGUF and it measures your GPUs, RAM, and PCIe topology, picks the backend llama.cpp or the faster ik llama.cpp fork , computes multi-GPU and MoE expert placement, and serves an OpenAI-compatible API. The placement is exact — read from the GGUF and measured VRAM, not guessed — so big MoE models load across mismatched GPUs that would run out of memory under other launchers. ggrun recommend rank the best models for YOUR hardware ggrun unsloth/Qwen3.6-27B-GGUF --download HF repo → hardware-matched quant → served ggrun model.gguf local GGUF → served ggrun no args → interactive TUI No flags: ggrun with no arguments opens a TUI that detects your GPUs, lists your models, computes hardware-matched launch settings, and ranks downloads by fit. Pass a model path or flags for one-shot CLI use instead. Same rig RTX 3090 Ti 24GB + 4070 12GB + 3060 12GB, 125GB RAM , same GGUFs, 32k context, decode tok/s 256-token generation , slowest backend on the left: | Model quant | Ollama 0.30.8 | llama.cpp --fit | ggrun v3 | v3 --ai-tune | v3 vs Ollama | |---|---|---|---|---|---| | Qwen3.5-4B Q4 K M | 124.8 | 103.3 | 151.4 | 185.7 | +49% | | Qwen3.6-27B Q5 K M | 22.8 | 24.3 | 37.4 | 39.8 | +74% | | Qwen3.5-122B-A10B UD-IQ4 XS MoE | 13.5† | 21.0 | 22.9 | 23.1 | +71% | | MiniMax-M3 UD-IQ3 XXS MoE | ✗ won't load | ✗ won't load | 5.59 | 5.65 | Ollama can't load | † Ollama can't import sharded GGUFs ollama 5245 https://github.com/ollama/ollama/issues/5245 , so the 122B was merged to one file before importing; MiniMax-M3 it can't load at all minimax-m3 is ik llama-only . Where models load, ggrun is 49–74% faster than Ollama on this rig — including +71% on the 122B MoE at heavy VRAM+RAM offload 60 GB, ~18 GB spilled to RAM . Driving the same llama.cpp master binary no ik llama , ggrun still beat raw --fit , so the gain is the placement, not just the backend swap. Measured with scripts/bench-v3-comparison.sh /raketenkater/ggrun/blob/main/scripts/bench-v3-comparison.sh on the rig above; full method and backend commits are in docs/launch-performance.md /raketenkater/ggrun/blob/main/docs/launch-performance.md . Linux / macOS — self-contained app home under ~/ggrun : curl -fsSL https://raw.githubusercontent.com/raketenkater/ggrun/main/setup.sh | bash Windows PowerShell ; add -Backend cuda for native NVIDIA CUDA: iwr -useb https://raw.githubusercontent.com/raketenkater/ggrun/main/install.ps1 | iex From a clone: git clone https://github.com/raketenkater/ggrun.git && cd ggrun && ./setup.sh Since v3, prebuilt release bundles https://github.com/raketenkater/ggrun/releases/latest Linux CPU/Vulkan, macOS arm64 Metal, Windows x86 64 CPU install without compiling, verified against SHA256SUMS ; Linux CUDA/ik llama.cpp builds from source for your GPU. Run ggrun with no arguments to open the TUI. Installer options and the app-home layout are in docs/install.md /raketenkater/ggrun/blob/main/docs/install.md . ggrun ~/models/model.gguf launch a local model ggrun unsloth/Qwen3.6-27B-GGUF --download download a fitting quant, then launch ggrun model.gguf --ai-tune re-benchmark flags as backends change, cache the fastest ggrun model.gguf --dry-run print the backend command without running ggrun model.gguf --benchmark load, measure tok/s, exit Common flags: --backend ik llama|llama|vulkan , --gpus 0,1 , --ctx-size , --kv-quality , --kv-placement , --vram-headroom 2G , --ram-headroom 8G , --vision , --spec auto . Unknown flags pass straight through to llama-server . Full list: docs/usage.md /raketenkater/ggrun/blob/main/docs/usage.md . Security:ggrun binds to 127.0.0.1 by default because the OpenAI-compatible API isunauthenticated. To serve trusted LAN clients, opt in with --host 0.0.0.0 or set LLM HOST / the Host setting in the TUI and restrict access with your firewall. vs raw llama.cpp. Upstream --fit auto-picks GPU layers, tensor-split, and context. If that covers you, raw llama.cpp may be enough. ggrun goes further: it selects the backend ik llama.cpp is meaningfully faster on CUDA , picks KV-cache type and batch sizes from measured probes, benchmarks candidate flag sets --ai-tune , finds/validates vision projectors and speculative drafts, and recovers from crashes. vs Ollama. Ollama wins on one-command simplicity and ecosystem on common hardware. ggrun targets where Ollama's conservative heuristics leave performance behind: mismatched multi-GPU rigs, MoE models split across VRAM/RAM, ik llama.cpp speed, and full flag access. One GPU and want zero config? Use Ollama. vs llama-swap. llama-swap hot-swaps between model commands you write yourself; ggrun computes those commands. They compose — point llama-swap at ggrun dry-run output, or use ggrun daemon for single-model swapping. | Capability | raw llama.cpp | ggrun | |---|---|---| | Multi-GPU / heterogeneous split | --fit recent | automatic, PCIe/bandwidth-weighted | | MoE expert placement | --fit / manual -ot | exact per-GPU ledger, backend-aware | | Backend selection ik llama / llama / Vulkan | manual | automatic, dialect-aware | | KV-cache type / batch sizing | manual | probe-measured | | AI Tune measured flag search | no | yes, cached per model+hardware | | Hardware-matched quant download | no | yes HF search, capability-vs-fit ranked | | Vision projector / speculative decoding | manual | automatic, validated | | Crash recovery / backend fallback | no | yes | - One Go binary; Linux, macOS, and native Windows. CUDA / Vulkan / Metal / CPU. - Exact-ledger multi-GPU + MoE expert placement --tensor-split + -ot from measured VRAM and GGUF sizes , with adaptive retry on out-of-memory. AI Tune — re-benchmarks candidate flag sets as llama.cpp and ik llama.cpp evolve and caches the fastest valid result per model + hardware, so your launch flags keep up with the backends without hand-tracking upstream. It only changes performance knobs batch, threads, flash-attn, speculative decoding , never output quality. A community tune pool seeds first launches LLM COMMUNITY TUNES=off .- Hugging Face downloader with hardware-aware quant selection and a GUI recommendation picker that ranks models by capability against how well each quant fits your VRAM. - Speculative decoding MTP, EAGLE-3, validated draft GGUFs and vision mmproj support. - OpenAI-compatible server, arrow-key TUI, crash recovery with backend fallback. ik llama.cpp CUDA, source build · llama.cpp Vulkan, Metal, CPU · native Windows CUDA via install.ps1 -Backend cuda . The backend binary is pluggable via LLAMA SERVER . AMD and Intel GPUs run through Vulkan no ROCm/HIP path . macOS/Metal builds and detects Apple unified memory but is pending validation on real Apple hardware — every benchmark here is from NVIDIA CUDA. Linux: curl , git , python3 ; cmake /compiler + NVIDIA CUDA toolkit for CUDA source builds; vulkaninfo for Vulkan detection. macOS: Apple Silicon; Xcode command-line tools for source builds. Windows: Windows 10/11 x86 64, PowerShell 5+, Python. -Backend cuda downloads prebuilt llama.cpp CUDA binaries no toolchain needed ; CUDA Toolkit + VS C++ Build Tools are only required for the from-source fallback. Install /raketenkater/ggrun/blob/main/docs/install.md · Usage /raketenkater/ggrun/blob/main/docs/usage.md · Architecture /raketenkater/ggrun/blob/main/docs/architecture.md · Benchmarks /raketenkater/ggrun/blob/main/docs/launch-performance.md · Speculative decoding /raketenkater/ggrun/blob/main/docs/speculative-decoding.md · Model recommendations /raketenkater/ggrun/blob/main/docs/model-recommendations.md · Changelog /raketenkater/ggrun/blob/main/CHANGELOG.md MIT