{"slug": "llama-cpp-flags-auto-tuning-tool", "title": "Llama.cpp flags auto-tuning tool", "summary": "Llama.cpp developer released ggrun, an auto-tuning tool that measures GPU, RAM, and PCIe topology to compute optimal multi-GPU and MoE expert placement for GGUF models, serving an OpenAI-compatible API. Benchmarks show ggrun outperforms Ollama by 49-74% on a three-GPU rig, including loading models Ollama cannot handle.", "body_md": "`ggrun`\n\n= \"gguf run\". Formerly **llm-server**.\n\n**Stop hand-writing --tensor-split, -ot, and KV-cache flags.** Point ggrun\nat a GGUF and it measures your GPUs, RAM, and PCIe topology, picks the backend\n(llama.cpp or the faster ik_llama.cpp fork), computes multi-GPU and MoE expert\nplacement, and serves an OpenAI-compatible API. The placement is exact — read\nfrom the GGUF and measured VRAM, not guessed — so big MoE models load across\nmismatched GPUs that would run out of memory under other launchers.\n\n```\nggrun recommend                           # rank the best models for YOUR hardware\nggrun unsloth/Qwen3.6-27B-GGUF --download # HF repo → hardware-matched quant → served\nggrun model.gguf                          # local GGUF → served\nggrun                                     # no args → interactive TUI\n```\n\n*No flags: ggrun with no arguments opens a TUI that detects your GPUs, lists your models, computes hardware-matched launch settings, and ranks downloads by fit. Pass a model path or flags for one-shot CLI use instead.*\n\nSame rig (RTX 3090 Ti 24GB + 4070 12GB + 3060 12GB, 125GB RAM), same GGUFs, 32k context, decode tok/s (256-token generation), slowest backend on the left:\n\n| Model (quant) | Ollama 0.30.8 | llama.cpp `--fit` |\nggrun v3 | v3 `--ai-tune` |\nv3 vs Ollama |\n|---|---|---|---|---|---|\n| Qwen3.5-4B Q4_K_M | 124.8 | 103.3 | 151.4 | 185.7 |\n+49% |\n| Qwen3.6-27B Q5_K_M | 22.8 | 24.3 | 37.4 | 39.8 |\n+74% |\n| Qwen3.5-122B-A10B UD-IQ4_XS (MoE) | 13.5† | 21.0 | 22.9 | 23.1 |\n+71% |\n| MiniMax-M3 UD-IQ3_XXS (MoE) | ✗ won't load | ✗ won't load | 5.59 | 5.65 |\nOllama can't load |\n\n† Ollama can't import sharded GGUFs ([ollama#5245](https://github.com/ollama/ollama/issues/5245)),\nso the 122B was merged to one file before importing; MiniMax-M3 it can't load at all\n(`minimax-m3`\n\nis ik_llama-only). Where models load, ggrun is 49–74% faster than\nOllama on this rig — including +71% on the 122B MoE at heavy VRAM+RAM offload\n(60 GB, ~18 GB spilled to RAM). Driving the *same* llama.cpp master binary (no ik_llama),\nggrun still beat raw `--fit`\n\n, so the gain is the placement, not just the backend swap.\nMeasured with [ scripts/bench-v3-comparison.sh](/raketenkater/ggrun/blob/main/scripts/bench-v3-comparison.sh) on the\nrig above; full method and backend commits are in\n\n[docs/launch-performance.md](/raketenkater/ggrun/blob/main/docs/launch-performance.md).\n\nLinux / macOS — self-contained app home under `~/ggrun`\n\n:\n\n```\ncurl -fsSL https://raw.githubusercontent.com/raketenkater/ggrun/main/setup.sh | bash\n```\n\nWindows (PowerShell); add `-Backend cuda`\n\nfor native NVIDIA CUDA:\n\n```\niwr -useb https://raw.githubusercontent.com/raketenkater/ggrun/main/install.ps1 | iex\n```\n\nFrom a clone:\n\n```\ngit clone https://github.com/raketenkater/ggrun.git && cd ggrun && ./setup.sh\n```\n\nSince v3, [prebuilt release bundles](https://github.com/raketenkater/ggrun/releases/latest)\n(Linux CPU/Vulkan, macOS arm64 Metal, Windows x86_64 CPU) install without compiling,\nverified against `SHA256SUMS`\n\n; Linux CUDA/ik_llama.cpp builds from source for your GPU.\nRun `ggrun`\n\nwith no arguments to open the TUI. Installer options and the app-home\nlayout are in [docs/install.md](/raketenkater/ggrun/blob/main/docs/install.md).\n\n```\nggrun ~/models/model.gguf                 # launch a local model\nggrun unsloth/Qwen3.6-27B-GGUF --download # download a fitting quant, then launch\nggrun model.gguf --ai-tune                # re-benchmark flags as backends change, cache the fastest\nggrun model.gguf --dry-run                # print the backend command without running\nggrun model.gguf --benchmark              # load, measure tok/s, exit\n```\n\nCommon flags: `--backend ik_llama|llama|vulkan`\n\n, `--gpus 0,1`\n\n, `--ctx-size`\n\n,\n`--kv-quality`\n\n, `--kv-placement`\n\n, `--vram-headroom 2G`\n\n, `--ram-headroom 8G`\n\n, `--vision`\n\n, `--spec auto`\n\n. Unknown flags pass straight\nthrough to `llama-server`\n\n. Full list:\n[docs/usage.md](/raketenkater/ggrun/blob/main/docs/usage.md).\n\nSecurity:ggrun binds to`127.0.0.1`\n\nby default because the OpenAI-compatible API isunauthenticated. To serve trusted LAN clients, opt in with`--host 0.0.0.0`\n\n(or set`LLM_HOST`\n\n/ the Host setting in the TUI) and restrict access with your firewall.\n\n**vs raw llama.cpp.** Upstream `--fit`\n\nauto-picks GPU layers, tensor-split, and context.\nIf that covers you, raw llama.cpp may be enough. ggrun goes further: it selects the\nbackend (ik_llama.cpp is meaningfully faster on CUDA), picks KV-cache type and batch sizes\nfrom measured probes, benchmarks candidate flag sets (`--ai-tune`\n\n), finds/validates vision\nprojectors and speculative drafts, and recovers from crashes.\n\n**vs Ollama.** Ollama wins on one-command simplicity and ecosystem on common hardware.\nggrun targets where Ollama's conservative heuristics leave performance behind:\nmismatched multi-GPU rigs, MoE models split across VRAM/RAM, ik_llama.cpp speed, and full\nflag access. One GPU and want zero config? Use Ollama.\n\n**vs llama-swap.** llama-swap hot-swaps between model commands you write yourself;\nggrun computes those commands. They compose — point llama-swap at `ggrun dry-run`\n\noutput, or use `ggrun daemon`\n\nfor single-model swapping.\n\n| Capability | raw llama.cpp | ggrun |\n|---|---|---|\n| Multi-GPU / heterogeneous split | `--fit` (recent) |\nautomatic, PCIe/bandwidth-weighted |\n| MoE expert placement | `--fit` / manual `-ot` |\nexact per-GPU ledger, backend-aware |\n| Backend selection (ik_llama / llama / Vulkan) | manual | automatic, dialect-aware |\n| KV-cache type / batch sizing | manual | probe-measured |\n| AI Tune (measured flag search) | no | yes, cached per model+hardware |\n| Hardware-matched quant download | no | yes (HF search, capability-vs-fit ranked) |\n| Vision projector / speculative decoding | manual | automatic, validated |\n| Crash recovery / backend fallback | no | yes |\n\n- One Go binary; Linux, macOS, and native Windows. CUDA / Vulkan / Metal / CPU.\n- Exact-ledger multi-GPU + MoE expert placement (\n`--tensor-split`\n\n+`-ot`\n\nfrom measured VRAM and GGUF sizes), with adaptive retry on out-of-memory. **AI Tune**— re-benchmarks candidate flag sets as llama.cpp and ik_llama.cpp evolve and caches the fastest valid result per model + hardware, so your launch flags keep up with the backends without hand-tracking upstream. It only changes performance knobs (batch, threads, flash-attn, speculative decoding), never output quality. A community tune pool seeds first launches (`LLM_COMMUNITY_TUNES=off`\n\n).- Hugging Face downloader with hardware-aware quant selection and a GUI recommendation picker that ranks models by capability against how well each quant fits your VRAM.\n- Speculative decoding (MTP, EAGLE-3, validated draft GGUFs) and vision (\n`mmproj`\n\n) support. - OpenAI-compatible server, arrow-key TUI, crash recovery with backend fallback.\n\nik_llama.cpp (CUDA, source build) · llama.cpp (Vulkan, Metal, CPU) · native Windows CUDA\nvia `install.ps1 -Backend cuda`\n\n. The backend binary is pluggable via `LLAMA_SERVER`\n\n.\nAMD and Intel GPUs run through Vulkan (no ROCm/HIP path). macOS/Metal builds and\ndetects Apple unified memory but is pending validation on real Apple hardware —\nevery benchmark here is from NVIDIA CUDA.\n\n**Linux:**`curl`\n\n,`git`\n\n,`python3`\n\n;`cmake`\n\n/compiler + NVIDIA CUDA toolkit for CUDA source builds;`vulkaninfo`\n\nfor Vulkan detection.**macOS:** Apple Silicon; Xcode command-line tools for source builds.**Windows:** Windows 10/11 x86_64, PowerShell 5+, Python.`-Backend cuda`\n\ndownloads prebuilt llama.cpp CUDA binaries (no toolchain needed); CUDA Toolkit + VS C++ Build Tools are only required for the from-source fallback.\n\n[Install](/raketenkater/ggrun/blob/main/docs/install.md) ·\n[Usage](/raketenkater/ggrun/blob/main/docs/usage.md) ·\n[Architecture](/raketenkater/ggrun/blob/main/docs/architecture.md) ·\n[Benchmarks](/raketenkater/ggrun/blob/main/docs/launch-performance.md) ·\n[Speculative decoding](/raketenkater/ggrun/blob/main/docs/speculative-decoding.md) ·\n[Model recommendations](/raketenkater/ggrun/blob/main/docs/model-recommendations.md) ·\n[Changelog](/raketenkater/ggrun/blob/main/CHANGELOG.md)\n\nMIT", "url": "https://wpnews.pro/news/llama-cpp-flags-auto-tuning-tool", "canonical_source": "https://github.com/raketenkater/ggrun", "published_at": "2026-06-26 09:49:24+00:00", "updated_at": "2026-06-26 10:05:30.966810+00:00", "lang": "en", "topics": ["ai-tools", "machine-learning", "large-language-models", "ai-infrastructure"], "entities": ["llama.cpp", "ggrun", "Ollama", "Qwen", "MiniMax", "RTX 3090 Ti", "RTX 4070", "RTX 3060"], "alternates": {"html": "https://wpnews.pro/news/llama-cpp-flags-auto-tuning-tool", "markdown": "https://wpnews.pro/news/llama-cpp-flags-auto-tuning-tool.md", "text": "https://wpnews.pro/news/llama-cpp-flags-auto-tuning-tool.txt", "jsonld": "https://wpnews.pro/news/llama-cpp-flags-auto-tuning-tool.jsonld"}}