# Llama.cpp flags auto-tuning tool

> Source: <https://github.com/raketenkater/ggrun>
> Published: 2026-06-26 09:49:24+00:00

`ggrun`

= "gguf run". Formerly **llm-server**.

**Stop hand-writing --tensor-split, -ot, and KV-cache flags.** Point ggrun
at a GGUF and it measures your GPUs, RAM, and PCIe topology, picks the backend
(llama.cpp or the faster ik_llama.cpp fork), computes multi-GPU and MoE expert
placement, and serves an OpenAI-compatible API. The placement is exact — read
from the GGUF and measured VRAM, not guessed — so big MoE models load across
mismatched GPUs that would run out of memory under other launchers.

```
ggrun recommend                           # rank the best models for YOUR hardware
ggrun unsloth/Qwen3.6-27B-GGUF --download # HF repo → hardware-matched quant → served
ggrun model.gguf                          # local GGUF → served
ggrun                                     # no args → interactive TUI
```

*No flags: ggrun with no arguments opens a TUI that detects your GPUs, lists your models, computes hardware-matched launch settings, and ranks downloads by fit. Pass a model path or flags for one-shot CLI use instead.*

Same rig (RTX 3090 Ti 24GB + 4070 12GB + 3060 12GB, 125GB RAM), same GGUFs, 32k context, decode tok/s (256-token generation), slowest backend on the left:

| Model (quant) | Ollama 0.30.8 | llama.cpp `--fit` |
ggrun v3 | v3 `--ai-tune` |
v3 vs Ollama |
|---|---|---|---|---|---|
| Qwen3.5-4B Q4_K_M | 124.8 | 103.3 | 151.4 | 185.7 |
+49% |
| Qwen3.6-27B Q5_K_M | 22.8 | 24.3 | 37.4 | 39.8 |
+74% |
| Qwen3.5-122B-A10B UD-IQ4_XS (MoE) | 13.5† | 21.0 | 22.9 | 23.1 |
+71% |
| MiniMax-M3 UD-IQ3_XXS (MoE) | ✗ won't load | ✗ won't load | 5.59 | 5.65 |
Ollama can't load |

† Ollama can't import sharded GGUFs ([ollama#5245](https://github.com/ollama/ollama/issues/5245)),
so the 122B was merged to one file before importing; MiniMax-M3 it can't load at all
(`minimax-m3`

is ik_llama-only). Where models load, ggrun is 49–74% faster than
Ollama on this rig — including +71% on the 122B MoE at heavy VRAM+RAM offload
(60 GB, ~18 GB spilled to RAM). Driving the *same* llama.cpp master binary (no ik_llama),
ggrun still beat raw `--fit`

, so the gain is the placement, not just the backend swap.
Measured with [ scripts/bench-v3-comparison.sh](/raketenkater/ggrun/blob/main/scripts/bench-v3-comparison.sh) on the
rig above; full method and backend commits are in

[docs/launch-performance.md](/raketenkater/ggrun/blob/main/docs/launch-performance.md).

Linux / macOS — self-contained app home under `~/ggrun`

:

```
curl -fsSL https://raw.githubusercontent.com/raketenkater/ggrun/main/setup.sh | bash
```

Windows (PowerShell); add `-Backend cuda`

for native NVIDIA CUDA:

```
iwr -useb https://raw.githubusercontent.com/raketenkater/ggrun/main/install.ps1 | iex
```

From a clone:

```
git clone https://github.com/raketenkater/ggrun.git && cd ggrun && ./setup.sh
```

Since v3, [prebuilt release bundles](https://github.com/raketenkater/ggrun/releases/latest)
(Linux CPU/Vulkan, macOS arm64 Metal, Windows x86_64 CPU) install without compiling,
verified against `SHA256SUMS`

; Linux CUDA/ik_llama.cpp builds from source for your GPU.
Run `ggrun`

with no arguments to open the TUI. Installer options and the app-home
layout are in [docs/install.md](/raketenkater/ggrun/blob/main/docs/install.md).

```
ggrun ~/models/model.gguf                 # launch a local model
ggrun unsloth/Qwen3.6-27B-GGUF --download # download a fitting quant, then launch
ggrun model.gguf --ai-tune                # re-benchmark flags as backends change, cache the fastest
ggrun model.gguf --dry-run                # print the backend command without running
ggrun model.gguf --benchmark              # load, measure tok/s, exit
```

Common flags: `--backend ik_llama|llama|vulkan`

, `--gpus 0,1`

, `--ctx-size`

,
`--kv-quality`

, `--kv-placement`

, `--vram-headroom 2G`

, `--ram-headroom 8G`

, `--vision`

, `--spec auto`

. Unknown flags pass straight
through to `llama-server`

. Full list:
[docs/usage.md](/raketenkater/ggrun/blob/main/docs/usage.md).

Security:ggrun binds to`127.0.0.1`

by default because the OpenAI-compatible API isunauthenticated. To serve trusted LAN clients, opt in with`--host 0.0.0.0`

(or set`LLM_HOST`

/ the Host setting in the TUI) and restrict access with your firewall.

**vs raw llama.cpp.** Upstream `--fit`

auto-picks GPU layers, tensor-split, and context.
If that covers you, raw llama.cpp may be enough. ggrun goes further: it selects the
backend (ik_llama.cpp is meaningfully faster on CUDA), picks KV-cache type and batch sizes
from measured probes, benchmarks candidate flag sets (`--ai-tune`

), finds/validates vision
projectors and speculative drafts, and recovers from crashes.

**vs Ollama.** Ollama wins on one-command simplicity and ecosystem on common hardware.
ggrun targets where Ollama's conservative heuristics leave performance behind:
mismatched multi-GPU rigs, MoE models split across VRAM/RAM, ik_llama.cpp speed, and full
flag access. One GPU and want zero config? Use Ollama.

**vs llama-swap.** llama-swap hot-swaps between model commands you write yourself;
ggrun computes those commands. They compose — point llama-swap at `ggrun dry-run`

output, or use `ggrun daemon`

for single-model swapping.

| Capability | raw llama.cpp | ggrun |
|---|---|---|
| Multi-GPU / heterogeneous split | `--fit` (recent) |
automatic, PCIe/bandwidth-weighted |
| MoE expert placement | `--fit` / manual `-ot` |
exact per-GPU ledger, backend-aware |
| Backend selection (ik_llama / llama / Vulkan) | manual | automatic, dialect-aware |
| KV-cache type / batch sizing | manual | probe-measured |
| AI Tune (measured flag search) | no | yes, cached per model+hardware |
| Hardware-matched quant download | no | yes (HF search, capability-vs-fit ranked) |
| Vision projector / speculative decoding | manual | automatic, validated |
| Crash recovery / backend fallback | no | yes |

- One Go binary; Linux, macOS, and native Windows. CUDA / Vulkan / Metal / CPU.
- Exact-ledger multi-GPU + MoE expert placement (
`--tensor-split`

+`-ot`

from measured VRAM and GGUF sizes), with adaptive retry on out-of-memory. **AI Tune**— re-benchmarks candidate flag sets as llama.cpp and ik_llama.cpp evolve and caches the fastest valid result per model + hardware, so your launch flags keep up with the backends without hand-tracking upstream. It only changes performance knobs (batch, threads, flash-attn, speculative decoding), never output quality. A community tune pool seeds first launches (`LLM_COMMUNITY_TUNES=off`

).- Hugging Face downloader with hardware-aware quant selection and a GUI recommendation picker that ranks models by capability against how well each quant fits your VRAM.
- Speculative decoding (MTP, EAGLE-3, validated draft GGUFs) and vision (
`mmproj`

) support. - OpenAI-compatible server, arrow-key TUI, crash recovery with backend fallback.

ik_llama.cpp (CUDA, source build) · llama.cpp (Vulkan, Metal, CPU) · native Windows CUDA
via `install.ps1 -Backend cuda`

. The backend binary is pluggable via `LLAMA_SERVER`

.
AMD and Intel GPUs run through Vulkan (no ROCm/HIP path). macOS/Metal builds and
detects Apple unified memory but is pending validation on real Apple hardware —
every benchmark here is from NVIDIA CUDA.

**Linux:**`curl`

,`git`

,`python3`

;`cmake`

/compiler + NVIDIA CUDA toolkit for CUDA source builds;`vulkaninfo`

for Vulkan detection.**macOS:** Apple Silicon; Xcode command-line tools for source builds.**Windows:** Windows 10/11 x86_64, PowerShell 5+, Python.`-Backend cuda`

downloads prebuilt llama.cpp CUDA binaries (no toolchain needed); CUDA Toolkit + VS C++ Build Tools are only required for the from-source fallback.

[Install](/raketenkater/ggrun/blob/main/docs/install.md) ·
[Usage](/raketenkater/ggrun/blob/main/docs/usage.md) ·
[Architecture](/raketenkater/ggrun/blob/main/docs/architecture.md) ·
[Benchmarks](/raketenkater/ggrun/blob/main/docs/launch-performance.md) ·
[Speculative decoding](/raketenkater/ggrun/blob/main/docs/speculative-decoding.md) ·
[Model recommendations](/raketenkater/ggrun/blob/main/docs/model-recommendations.md) ·
[Changelog](/raketenkater/ggrun/blob/main/CHANGELOG.md)

MIT