# llama.cpp b9455 Finally Caught vLLM: 70t/s on 2x3090 Qwen 27B UQ8

> Source: <https://dev.to/yiqinumber1/llamacpp-b9455-finally-caught-vllm-70ts-on-2x3090-qwen-27b-uq8-1m74>
> Published: 2026-06-03 06:03:19+00:00

A Reddit user on r/LocalLLaMA just dropped some impressive numbers for llama.cpp build b9455, and they're worth paying attention to if you're running multi-GPU setups.

For months, vLLM was the undisputed king of multi-GPU inference — its tensor parallelism consistently hit 70+ tokens per second on dual 3090s while llama.cpp languished at 30-50 t/s. GGUF users grudgingly accepted the speed penalty as the cost of good quantization.

Build b9455 changed that.

Hardware: 2x RTX 3090s (24GB each). Model: Unsloth's `Qwen3.6-27B-UD-Q8_K_XL`

— a UD-Q8 quant that this user had been running at 30-50 t/s on older llama.cpp builds.

The key new flags:

```
--tensor-split 50,50 -sm tensor
--flash-attn on
--cache-type-k q8_0 --cache-type-v q8_0
--spec-type draft-mtp --spec-draft-n-max 3
```

The `-sm tensor`

flag is the magic here. It changes how llama.cpp splits the model across GPUs — instead of the default row-based split (which leaves one card mostly idle during certain operations), tensor parallelism distributes individual matrix multiplications across both GPUs simultaneously.

Here's a raw coding session trace. Each line shows context size and throughput:

```
ctx 27K · pp 27K/18.8s 1417t/s · out 248/3.0s 81t/s · cold
ctx 31K · pp 3.8K/3.2s 1171t/s · out 353/4.7s 74t/s · 27K cached
ctx 37K · pp 6.7K/5.7s 1184t/s · out 335/4.5s 74t/s · 31K cached
ctx 43K · pp 5.5K/4.9s 1121t/s · out 357/5.0s 71t/s · 37K cached
ctx 44K · pp 1.3K/1.5s 861t/s · out 377/5.2s 72t/s · 43K cached
ctx 63K · pp 6.0K/4.8s 1266t/s · out 856/12.3s 69t/s · 57K cached
ctx 68K · pp 68K/54.2s 1247t/s · out 2.0K/28.8s 68t/s · cold
```

Three things jump out:

**1. Decode speed is rock-solid at 67-81 t/s.** Even at 68K context with a 2K token output, it held 68 t/s. That's the kind of consistency you need for agent workloads where context grows relentlessly across turns.

**2. Prompt processing is absurdly fast.** Cold-started 27K context filled in 18.8 seconds — 1,417 tokens per second of prefill. At that rate you're looking at about 60-70 seconds to fill a full 100K context from cold.

**3. The 54 t/s low point was a 4,500-token decode.** Long outputs are usually the bottleneck. Even there it stayed above 50 t/s, which at Q8 quality is more than usable for streaming a full code review or refactor.

The OP had been running `qwen3.6-mtp-8.0`

on vLLM as a compromise — it ran fast, but the 8.0 quant was making subtle coding mistakes. Wrong variable names. Off-by-one errors in generated loops. The kind of bugs that pass unit tests but fail code review.

UD-Q8_K_XL at this speed is a completely different experience. Code output is clean — not "mostly correct," actually correct. For anyone feeding models into an agent loop that runs 20+ turns without human intervention, those silent errors compound fast. Going back to Q8 at vLLM speed eliminates an entire class of failures.

A few configuration details worth noting:

`--spec-draft-n-max 3`

): the draft model predicts 2-3 tokens ahead accurately enough to justify the extra compute. Going higher than 3 showed diminishing returns.`--no-mmap`

If you shelved llama.cpp for multi-GPU inference because vLLM was faster — especially if you're forced into lower-quality quants on vLLM to get acceptable speed — `b9455`

with `-sm tensor`

is worth a retest. The gap is gone.

Full credit to the original Reddit post on r/LocalLLaMA for the benchmarks. What are you seeing with tensor-split on your hardware?
