A Reddit user on r/LocalLLaMA just dropped some impressive numbers for llama.cpp build b9455, and they're worth paying attention to if you're running multi-GPU setups.
For months, vLLM was the undisputed king of multi-GPU inference — its tensor parallelism consistently hit 70+ tokens per second on dual 3090s while llama.cpp languished at 30-50 t/s. GGUF users grudgingly accepted the speed penalty as the cost of good quantization.
Build b9455 changed that.
Hardware: 2x RTX 3090s (24GB each). Model: Unsloth's Qwen3.6-27B-UD-Q8_K_XL
— a UD-Q8 quant that this user had been running at 30-50 t/s on older llama.cpp builds.
The key new flags:
--tensor-split 50,50 -sm tensor
--flash-attn on
--cache-type-k q8_0 --cache-type-v q8_0
--spec-type draft-mtp --spec-draft-n-max 3
The -sm tensor
flag is the magic here. It changes how llama.cpp splits the model across GPUs — instead of the default row-based split (which leaves one card mostly idle during certain operations), tensor parallelism distributes individual matrix multiplications across both GPUs simultaneously.
Here's a raw coding session trace. Each line shows context size and throughput:
ctx 27K · pp 27K/18.8s 1417t/s · out 248/3.0s 81t/s · cold
ctx 31K · pp 3.8K/3.2s 1171t/s · out 353/4.7s 74t/s · 27K cached
ctx 37K · pp 6.7K/5.7s 1184t/s · out 335/4.5s 74t/s · 31K cached
ctx 43K · pp 5.5K/4.9s 1121t/s · out 357/5.0s 71t/s · 37K cached
ctx 44K · pp 1.3K/1.5s 861t/s · out 377/5.2s 72t/s · 43K cached
ctx 63K · pp 6.0K/4.8s 1266t/s · out 856/12.3s 69t/s · 57K cached
ctx 68K · pp 68K/54.2s 1247t/s · out 2.0K/28.8s 68t/s · cold
Three things jump out:
1. Decode speed is rock-solid at 67-81 t/s. Even at 68K context with a 2K token output, it held 68 t/s. That's the kind of consistency you need for agent workloads where context grows relentlessly across turns.
2. Prompt processing is absurdly fast. Cold-started 27K context filled in 18.8 seconds — 1,417 tokens per second of prefill. At that rate you're looking at about 60-70 seconds to fill a full 100K context from cold.
3. The 54 t/s low point was a 4,500-token decode. Long outputs are usually the bottleneck. Even there it stayed above 50 t/s, which at Q8 quality is more than usable for streaming a full code review or refactor.
The OP had been running qwen3.6-mtp-8.0
on vLLM as a compromise — it ran fast, but the 8.0 quant was making subtle coding mistakes. Wrong variable names. Off-by-one errors in generated loops. The kind of bugs that pass unit tests but fail code review.
UD-Q8_K_XL at this speed is a completely different experience. Code output is clean — not "mostly correct," actually correct. For anyone feeding models into an agent loop that runs 20+ turns without human intervention, those silent errors compound fast. Going back to Q8 at vLLM speed eliminates an entire class of failures.
A few configuration details worth noting:
--spec-draft-n-max 3
): the draft model predicts 2-3 tokens ahead accurately enough to justify the extra compute. Going higher than 3 showed diminishing returns.--no-mmap
If you shelved llama.cpp for multi-GPU inference because vLLM was faster — especially if you're forced into lower-quality quants on vLLM to get acceptable speed — b9455
with -sm tensor
is worth a retest. The gap is gone.
Full credit to the original Reddit post on r/LocalLLaMA for the benchmarks. What are you seeing with tensor-split on your hardware?