# Qwen3.6 MTP in llama.cpp: 27B Model Now 1.7x Faster

> Source: <https://byteiota.com/qwen3-6-mtp-in-llama-cpp-27b-model-now-1-7x-faster/>
> Published: 2026-06-30 11:14:54+00:00

On May 16, 2026, llama.cpp merged Multi-Token Prediction (MTP) support into its mainline build — and the reaction on Hacker News said everything: “46 t/s with Qwen3.6 27B Q8… this is insane, 250% faster than the original speed… there was no GPU I could upgrade to get that kind of boost.” MTP is speculative decoding baked directly into Qwen3.6’s model weights, enabled with a single flag. The result is 1.7x to 2.4x faster local inference with zero accuracy loss and no extra model download required.

If you run Qwen3.6 27B locally — or have been waiting for local AI to feel genuinely fast — this is the most impactful performance upgrade available right now. No new hardware. No bigger model. Just a flag.

## What MTP Is (And Why There Is No Extra Download)

Speculative decoding is not a new idea, but traditional implementations required a second small “draft model” — a separate download you had to sync with your main model’s quantization and manage in memory. MTP eliminates all of that. The draft head is trained alongside Qwen3.6’s main weights and lives inside the same GGUF file. The overhead is roughly 1 GB of extra VRAM.

At inference time, the MTP head proposes several candidate next tokens — say, tokens 2, 3, 4, and 5. The main model then verifies all four in a single forward pass. If tokens 2 through 4 are accepted, you get three tokens for the cost of one forward pass. The output is mathematically identical to standard generation; nothing about the model’s behavior changes. Rejection just means the main model generates the correct token via the standard path, and decoding continues as normal.

The acceptance rate — and thus the actual speedup — depends on how predictable the output is. Coding tasks, structured output, and repetitive patterns accept more tokens. Highly creative outputs accept fewer. In practice, most developer workloads lean toward the former, which is exactly why the gains on [Qwen3.6](https://github.com/QwenLM/Qwen3.6), a model optimized for agentic coding, are as large as they are.

## What Developers Are Actually Seeing: MTP Benchmarks

Benchmarks from multiple independent testers across different hardware tell a consistent story. According to [Startup Fortune’s analysis of the llama.cpp MTP merge](https://startupfortune.com/llamacpp-adds-multi-token-prediction-and-doubles-qwen36-27b-throughput-for-local-inference/), the gains are real and hardware-agnostic:

| Hardware | Without MTP | With MTP | Speedup |
|---|---|---|---|
| RTX 3090 (Q6_K) | 38 t/s | 65 t/s | 1.71x |
| AMD Strix Halo (Q8_0) | 7.4 t/s | 18.1 t/s | 2.44x |
| RTX A6000 | ~20 t/s | ~55 t/s | ~2.75x |
| DGX Spark GB10 (Q4_K_M) | 13.1 t/s | 27.6 t/s | 2.1x |

Notably, AMD hardware benefits proportionally more than high-end NVIDIA setups. The 2.44x on AMD Strix Halo stands out — historically, CUDA-optimized inference stacks have favored NVIDIA. MTP narrows that gap in a meaningful way. The [NVIDIA Developer Forum thread on DGX Spark results](https://forums.developer.nvidia.com/t/mtp-llama-cpp-a-look-at-qwen3-6-27b/370298) is worth reading if you’re on high-VRAM hardware. Per Unsloth’s own testing, RTX 6000 with MTP now hits 160 tokens per second on Qwen3.6 27B.

Related:[DGX Spark June 2026: Four Nodes, 700B Models Locally]

## How to Enable Qwen3.6 MTP in llama.cpp Right Now

You need llama.cpp build 9200 or higher (the MTP PR merged May 16) and an MTP-labeled GGUF — standard Qwen3.6 GGUFs do not include the MTP head. [Unsloth’s Qwen3.6 GGUFs](https://unsloth.ai/docs/models/qwen3.6) exited experimental status in late May and are the recommended starting point. On macOS: `brew upgrade llama.cpp`

is sufficient. On Linux, build from source with CUDA enabled:

```
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build -DGGML_CUDA=ON
cmake --build llama.cpp/build --config Release -j --target llama-cli
```

Once you have a current build, enabling MTP is two flags:

```
llama-cli -hf unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL \
  --temp 1.0 --top-p 0.95 \
  --spec-type draft-mtp \
  --spec-draft-n-max 2
```

The flag was renamed from `--draft-mtp`

to `--spec-type draft-mtp`

in the stable merge — older tutorials have the wrong syntax. Start with `--spec-draft-n-max 2`

; community testing shows this is optimal for most Q4 and Q6 quantizations. Pushing to 3 or 4 increases rejection rates on consumer GPUs and often reduces net throughput. One additional gotcha: avoid CUDA 13.2, which causes output corruption on Qwen3.6. CUDA 13.1 or 13.3+ are both fine.

VRAM requirements with MTP enabled: 27B at Q4_K_M needs approximately 19 GB (18 GB base plus 1 GB for the MTP head). The 35B-A3B MoE variant needs approximately 24 GB at Q4_K_M and offers even higher throughput — 240 tokens per second on RTX 6000 per Unsloth’s benchmarks — if you have the headroom.

## When MTP Helps — and When It Actively Hurts

MTP is a tool for individual developers and low-concurrency workloads. It is not a solution for shared inference servers. The DGX Spark result is instructive: at single-user concurrency, MTP boosted throughput from 13.1 to 27.6 tokens per second — a 2.1x gain. At four concurrent users, throughput with MTP dropped to 29.9 t/s versus 41.5 t/s without it. MTP actively hurt multi-user performance. If you run a team inference server, benchmark your actual concurrency levels before enabling it.

There is also a model compatibility requirement. MTP only works with models trained with MTP heads. Qwen3.6 has one. So do Qwen3.5 (7B and larger), DeepSeek V3 and R1, and Gemma 4 26B-A4B. Llama 3, Mistral, and older Gemma versions do not — you cannot retrofit an existing GGUF with an MTP head. Finally, Ollama currently cannot run Qwen3.6 GGUFs at all: the vision components use separate mmproj files that Ollama does not handle. Use llama.cpp directly or Unsloth Studio instead. The community is tracking Ollama MTP support, but there is no ETA.

## Key Takeaways

- llama.cpp merged MTP support on May 16, 2026 — it is in mainline now, no custom branches required
- Qwen3.6 27B sees 1.7x to 2.4x faster inference with
`--spec-type draft-mtp --spec-draft-n-max 2`

and zero accuracy change - No separate draft model needed — the MTP head lives in the GGUF file for approximately 1 GB extra VRAM
- MTP is best for single-user workloads; avoid on shared inference servers at concurrency 4+ until you benchmark
- Ollama users cannot use this yet — use llama.cpp directly or
[Unsloth Studio](https://unsloth.ai/docs/models/qwen3.6)instead