{"slug": "llama-cpp-b9455-finally-caught-vllm-70t-s-on-2x3090-qwen-27b-uq8", "title": "llama.cpp b9455 Finally Caught vLLM: 70t/s on 2x3090 Qwen 27B UQ8", "summary": "A Reddit user reported that llama.cpp build b9455 achieved 67-81 tokens per second on a dual RTX 3090 setup running Unsloth's Qwen3.6-27B-UD-Q8_K_XL model, matching the speed of vLLM for multi-GPU inference. The performance gain came from the new `-sm tensor` flag, which enables true tensor parallelism across GPUs, eliminating the speed gap that previously forced users to accept lower-quality quantizations for acceptable throughput.", "body_md": "A Reddit user on r/LocalLLaMA just dropped some impressive numbers for llama.cpp build b9455, and they're worth paying attention to if you're running multi-GPU setups.\n\nFor months, vLLM was the undisputed king of multi-GPU inference — its tensor parallelism consistently hit 70+ tokens per second on dual 3090s while llama.cpp languished at 30-50 t/s. GGUF users grudgingly accepted the speed penalty as the cost of good quantization.\n\nBuild b9455 changed that.\n\nHardware: 2x RTX 3090s (24GB each). Model: Unsloth's `Qwen3.6-27B-UD-Q8_K_XL`\n\n— a UD-Q8 quant that this user had been running at 30-50 t/s on older llama.cpp builds.\n\nThe key new flags:\n\n```\n--tensor-split 50,50 -sm tensor\n--flash-attn on\n--cache-type-k q8_0 --cache-type-v q8_0\n--spec-type draft-mtp --spec-draft-n-max 3\n```\n\nThe `-sm tensor`\n\nflag is the magic here. It changes how llama.cpp splits the model across GPUs — instead of the default row-based split (which leaves one card mostly idle during certain operations), tensor parallelism distributes individual matrix multiplications across both GPUs simultaneously.\n\nHere's a raw coding session trace. Each line shows context size and throughput:\n\n```\nctx 27K · pp 27K/18.8s 1417t/s · out 248/3.0s 81t/s · cold\nctx 31K · pp 3.8K/3.2s 1171t/s · out 353/4.7s 74t/s · 27K cached\nctx 37K · pp 6.7K/5.7s 1184t/s · out 335/4.5s 74t/s · 31K cached\nctx 43K · pp 5.5K/4.9s 1121t/s · out 357/5.0s 71t/s · 37K cached\nctx 44K · pp 1.3K/1.5s 861t/s · out 377/5.2s 72t/s · 43K cached\nctx 63K · pp 6.0K/4.8s 1266t/s · out 856/12.3s 69t/s · 57K cached\nctx 68K · pp 68K/54.2s 1247t/s · out 2.0K/28.8s 68t/s · cold\n```\n\nThree things jump out:\n\n**1. Decode speed is rock-solid at 67-81 t/s.** Even at 68K context with a 2K token output, it held 68 t/s. That's the kind of consistency you need for agent workloads where context grows relentlessly across turns.\n\n**2. Prompt processing is absurdly fast.** Cold-started 27K context filled in 18.8 seconds — 1,417 tokens per second of prefill. At that rate you're looking at about 60-70 seconds to fill a full 100K context from cold.\n\n**3. The 54 t/s low point was a 4,500-token decode.** Long outputs are usually the bottleneck. Even there it stayed above 50 t/s, which at Q8 quality is more than usable for streaming a full code review or refactor.\n\nThe OP had been running `qwen3.6-mtp-8.0`\n\non vLLM as a compromise — it ran fast, but the 8.0 quant was making subtle coding mistakes. Wrong variable names. Off-by-one errors in generated loops. The kind of bugs that pass unit tests but fail code review.\n\nUD-Q8_K_XL at this speed is a completely different experience. Code output is clean — not \"mostly correct,\" actually correct. For anyone feeding models into an agent loop that runs 20+ turns without human intervention, those silent errors compound fast. Going back to Q8 at vLLM speed eliminates an entire class of failures.\n\nA few configuration details worth noting:\n\n`--spec-draft-n-max 3`\n\n): the draft model predicts 2-3 tokens ahead accurately enough to justify the extra compute. Going higher than 3 showed diminishing returns.`--no-mmap`\n\nIf you shelved llama.cpp for multi-GPU inference because vLLM was faster — especially if you're forced into lower-quality quants on vLLM to get acceptable speed — `b9455`\n\nwith `-sm tensor`\n\nis worth a retest. The gap is gone.\n\nFull credit to the original Reddit post on r/LocalLLaMA for the benchmarks. What are you seeing with tensor-split on your hardware?", "url": "https://wpnews.pro/news/llama-cpp-b9455-finally-caught-vllm-70t-s-on-2x3090-qwen-27b-uq8", "canonical_source": "https://dev.to/yiqinumber1/llamacpp-b9455-finally-caught-vllm-70ts-on-2x3090-qwen-27b-uq8-1m74", "published_at": "2026-06-03 06:03:19+00:00", "updated_at": "2026-06-03 06:11:18.436642+00:00", "lang": "en", "topics": ["large-language-models", "artificial-intelligence", "machine-learning", "ai-infrastructure", "ai-tools"], "entities": ["llama.cpp", "vLLM", "Reddit", "Unsloth", "Qwen", "RTX 3090", "GGUF", "UD-Q8"], "alternates": {"html": "https://wpnews.pro/news/llama-cpp-b9455-finally-caught-vllm-70t-s-on-2x3090-qwen-27b-uq8", "markdown": "https://wpnews.pro/news/llama-cpp-b9455-finally-caught-vllm-70t-s-on-2x3090-qwen-27b-uq8.md", "text": "https://wpnews.pro/news/llama-cpp-b9455-finally-caught-vllm-70t-s-on-2x3090-qwen-27b-uq8.txt", "jsonld": "https://wpnews.pro/news/llama-cpp-b9455-finally-caught-vllm-70t-s-on-2x3090-qwen-27b-uq8.jsonld"}}