cd /news/large-language-models/llama-cpp-b9455-finally-caught-vllm-… · home topics large-language-models article
[ARTICLE · art-20015] src=dev.to pub= topic=large-language-models verified=true sentiment=↑ positive

llama.cpp b9455 Finally Caught vLLM: 70t/s on 2x3090 Qwen 27B UQ8

A Reddit user reported that llama.cpp build b9455 achieved 67-81 tokens per second on a dual RTX 3090 setup running Unsloth's Qwen3.6-27B-UD-Q8_K_XL model, matching the speed of vLLM for multi-GPU inference. The performance gain came from the new `-sm tensor` flag, which enables true tensor parallelism across GPUs, eliminating the speed gap that previously forced users to accept lower-quality quantizations for acceptable throughput.

read3 min publishedJun 3, 2026

A Reddit user on r/LocalLLaMA just dropped some impressive numbers for llama.cpp build b9455, and they're worth paying attention to if you're running multi-GPU setups.

For months, vLLM was the undisputed king of multi-GPU inference — its tensor parallelism consistently hit 70+ tokens per second on dual 3090s while llama.cpp languished at 30-50 t/s. GGUF users grudgingly accepted the speed penalty as the cost of good quantization.

Build b9455 changed that.

Hardware: 2x RTX 3090s (24GB each). Model: Unsloth's Qwen3.6-27B-UD-Q8_K_XL

— a UD-Q8 quant that this user had been running at 30-50 t/s on older llama.cpp builds.

The key new flags:

--tensor-split 50,50 -sm tensor
--flash-attn on
--cache-type-k q8_0 --cache-type-v q8_0
--spec-type draft-mtp --spec-draft-n-max 3

The -sm tensor

flag is the magic here. It changes how llama.cpp splits the model across GPUs — instead of the default row-based split (which leaves one card mostly idle during certain operations), tensor parallelism distributes individual matrix multiplications across both GPUs simultaneously.

Here's a raw coding session trace. Each line shows context size and throughput:

ctx 27K · pp 27K/18.8s 1417t/s · out 248/3.0s 81t/s · cold
ctx 31K · pp 3.8K/3.2s 1171t/s · out 353/4.7s 74t/s · 27K cached
ctx 37K · pp 6.7K/5.7s 1184t/s · out 335/4.5s 74t/s · 31K cached
ctx 43K · pp 5.5K/4.9s 1121t/s · out 357/5.0s 71t/s · 37K cached
ctx 44K · pp 1.3K/1.5s 861t/s · out 377/5.2s 72t/s · 43K cached
ctx 63K · pp 6.0K/4.8s 1266t/s · out 856/12.3s 69t/s · 57K cached
ctx 68K · pp 68K/54.2s 1247t/s · out 2.0K/28.8s 68t/s · cold

Three things jump out:

1. Decode speed is rock-solid at 67-81 t/s. Even at 68K context with a 2K token output, it held 68 t/s. That's the kind of consistency you need for agent workloads where context grows relentlessly across turns.

2. Prompt processing is absurdly fast. Cold-started 27K context filled in 18.8 seconds — 1,417 tokens per second of prefill. At that rate you're looking at about 60-70 seconds to fill a full 100K context from cold.

3. The 54 t/s low point was a 4,500-token decode. Long outputs are usually the bottleneck. Even there it stayed above 50 t/s, which at Q8 quality is more than usable for streaming a full code review or refactor.

The OP had been running qwen3.6-mtp-8.0

on vLLM as a compromise — it ran fast, but the 8.0 quant was making subtle coding mistakes. Wrong variable names. Off-by-one errors in generated loops. The kind of bugs that pass unit tests but fail code review.

UD-Q8_K_XL at this speed is a completely different experience. Code output is clean — not "mostly correct," actually correct. For anyone feeding models into an agent loop that runs 20+ turns without human intervention, those silent errors compound fast. Going back to Q8 at vLLM speed eliminates an entire class of failures.

A few configuration details worth noting:

--spec-draft-n-max 3

): the draft model predicts 2-3 tokens ahead accurately enough to justify the extra compute. Going higher than 3 showed diminishing returns.--no-mmap

If you shelved llama.cpp for multi-GPU inference because vLLM was faster — especially if you're forced into lower-quality quants on vLLM to get acceptable speed — b9455

with -sm tensor

is worth a retest. The gap is gone.

Full credit to the original Reddit post on r/LocalLLaMA for the benchmarks. What are you seeing with tensor-split on your hardware?

── more in #large-language-models 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/llama-cpp-b9455-fina…] indexed:0 read:3min 2026-06-03 ·