ggrun
= "gguf run". Formerly llm-server.
Stop hand-writing --tensor-split, -ot, and KV-cache flags. Point ggrun at a GGUF and it measures your GPUs, RAM, and PCIe topology, picks the backend (llama.cpp or the faster ik_llama.cpp fork), computes multi-GPU and MoE expert placement, and serves an OpenAI-compatible API. The placement is exact — read from the GGUF and measured VRAM, not guessed — so big MoE models load across mismatched GPUs that would run out of memory under other launchers.
ggrun recommend # rank the best models for YOUR hardware
ggrun unsloth/Qwen3.6-27B-GGUF --download # HF repo → hardware-matched quant → served
ggrun model.gguf # local GGUF → served
ggrun # no args → interactive TUI
No flags: ggrun with no arguments opens a TUI that detects your GPUs, lists your models, computes hardware-matched launch settings, and ranks downloads by fit. Pass a model path or flags for one-shot CLI use instead.
Same rig (RTX 3090 Ti 24GB + 4070 12GB + 3060 12GB, 125GB RAM), same GGUFs, 32k context, decode tok/s (256-token generation), slowest backend on the left:
| Model (quant) | Ollama 0.30.8 | llama.cpp --fit |
ggrun v3 | v3 --ai-tune |
v3 vs Ollama |
|---|---|---|---|---|---|
| Qwen3.5-4B Q4_K_M | 124.8 | 103.3 | 151.4 | 185.7 |
+49% |
| Qwen3.6-27B Q5_K_M | 22.8 | 24.3 | 37.4 | 39.8 |
+74% |
| Qwen3.5-122B-A10B UD-IQ4_XS (MoE) | 13.5† | 21.0 | 22.9 | 23.1 |
+71% |
| MiniMax-M3 UD-IQ3_XXS (MoE) | ✗ won't load | ✗ won't load | 5.59 | 5.65 |
Ollama can't load |
† Ollama can't import sharded GGUFs (ollama#5245),
so the 122B was merged to one file before importing; MiniMax-M3 it can't load at all
(minimax-m3
is ik_llama-only). Where models load, ggrun is 49–74% faster than
Ollama on this rig — including +71% on the 122B MoE at heavy VRAM+RAM offload
(60 GB, ~18 GB spilled to RAM). Driving the same llama.cpp master binary (no ik_llama),
ggrun still beat raw --fit
, so the gain is the placement, not just the backend swap. Measured with scripts/bench-v3-comparison.sh on the rig above; full method and backend commits are in
Linux / macOS — self-contained app home under ~/ggrun
:
curl -fsSL https://raw.githubusercontent.com/raketenkater/ggrun/main/setup.sh | bash
Windows (PowerShell); add -Backend cuda
for native NVIDIA CUDA:
iwr -useb https://raw.githubusercontent.com/raketenkater/ggrun/main/install.ps1 | iex
From a clone:
git clone https://github.com/raketenkater/ggrun.git && cd ggrun && ./setup.sh
Since v3, prebuilt release bundles
(Linux CPU/Vulkan, macOS arm64 Metal, Windows x86_64 CPU) install without compiling,
verified against SHA256SUMS
; Linux CUDA/ik_llama.cpp builds from source for your GPU.
Run ggrun
with no arguments to open the TUI. Installer options and the app-home layout are in docs/install.md.
ggrun ~/models/model.gguf # launch a local model
ggrun unsloth/Qwen3.6-27B-GGUF --download # download a fitting quant, then launch
ggrun model.gguf --ai-tune # re-benchmark flags as backends change, cache the fastest
ggrun model.gguf --dry-run # print the backend command without running
ggrun model.gguf --benchmark # load, measure tok/s, exit
Common flags: --backend ik_llama|llama|vulkan
, --gpus 0,1
, --ctx-size
,
--kv-quality
, --kv-placement
, --vram-headroom 2G
, --ram-headroom 8G
, --vision
, --spec auto
. Unknown flags pass straight
through to llama-server
. Full list: docs/usage.md.
Security:ggrun binds to127.0.0.1
by default because the OpenAI-compatible API isunauthenticated. To serve trusted LAN clients, opt in with--host 0.0.0.0
(or setLLM_HOST
/ the Host setting in the TUI) and restrict access with your firewall.
vs raw llama.cpp. Upstream --fit
auto-picks GPU layers, tensor-split, and context.
If that covers you, raw llama.cpp may be enough. ggrun goes further: it selects the
backend (ik_llama.cpp is meaningfully faster on CUDA), picks KV-cache type and batch sizes
from measured probes, benchmarks candidate flag sets (--ai-tune
), finds/validates vision projectors and speculative drafts, and recovers from crashes.
vs Ollama. Ollama wins on one-command simplicity and ecosystem on common hardware. ggrun targets where Ollama's conservative heuristics leave performance behind: mismatched multi-GPU rigs, MoE models split across VRAM/RAM, ik_llama.cpp speed, and full flag access. One GPU and want zero config? Use Ollama.
vs llama-swap. llama-swap hot-swaps between model commands you write yourself;
ggrun computes those commands. They compose — point llama-swap at ggrun dry-run
output, or use ggrun daemon
for single-model swapping.
| Capability | raw llama.cpp | ggrun |
|---|---|---|
| Multi-GPU / heterogeneous split | --fit (recent) |
|
| automatic, PCIe/bandwidth-weighted | ||
| MoE expert placement | --fit / manual -ot |
|
| exact per-GPU ledger, backend-aware | ||
| Backend selection (ik_llama / llama / Vulkan) | manual | automatic, dialect-aware |
| KV-cache type / batch sizing | manual | probe-measured |
| AI Tune (measured flag search) | no | yes, cached per model+hardware |
| Hardware-matched quant download | no | yes (HF search, capability-vs-fit ranked) |
| Vision projector / speculative decoding | manual | automatic, validated |
| Crash recovery / backend fallback | no | yes |
- One Go binary; Linux, macOS, and native Windows. CUDA / Vulkan / Metal / CPU.
- Exact-ledger multi-GPU + MoE expert placement (
--tensor-split
+-ot
from measured VRAM and GGUF sizes), with adaptive retry on out-of-memory. AI Tune— re-benchmarks candidate flag sets as llama.cpp and ik_llama.cpp evolve and caches the fastest valid result per model + hardware, so your launch flags keep up with the backends without hand-tracking upstream. It only changes performance knobs (batch, threads, flash-attn, speculative decoding), never output quality. A community tune pool seeds first launches (LLM_COMMUNITY_TUNES=off
).- Hugging Face down with hardware-aware quant selection and a GUI recommendation picker that ranks models by capability against how well each quant fits your VRAM.
- Speculative decoding (MTP, EAGLE-3, validated draft GGUFs) and vision (
mmproj
) support. - OpenAI-compatible server, arrow-key TUI, crash recovery with backend fallback.
ik_llama.cpp (CUDA, source build) · llama.cpp (Vulkan, Metal, CPU) · native Windows CUDA
via install.ps1 -Backend cuda
. The backend binary is pluggable via LLAMA_SERVER
. AMD and Intel GPUs run through Vulkan (no ROCm/HIP path). macOS/Metal builds and detects Apple unified memory but is pending validation on real Apple hardware — every benchmark here is from NVIDIA CUDA.
Linux:curl
,git
,python3
;cmake
/compiler + NVIDIA CUDA toolkit for CUDA source builds;vulkaninfo
for Vulkan detection.macOS: Apple Silicon; Xcode command-line tools for source builds.Windows: Windows 10/11 x86_64, PowerShell 5+, Python.-Backend cuda
downloads prebuilt llama.cpp CUDA binaries (no toolchain needed); CUDA Toolkit + VS C++ Build Tools are only required for the from-source fallback.
Install · Usage · Architecture · Benchmarks · Speculative decoding · Model recommendations · Changelog
MIT