I was curious why MTP affects PP TPS in llama.cpp. My PoC recovers it?

A developer investigating low prompt processing throughput with Multi-Token Prediction (MTP) in llama.cpp created a proof-of-concept that recovers the overhead by processing only the output row of the last layer MoE FFN instead of the entire ubatch. Benchmarks show prompt processing tokens per second returned to non-MTP levels with a 20% uplift, while retaining most MTP benefits for token generation throughput. The PoC, written with AI assistance from GLM 5.1 and reviewed by GLM 5.2, is released under MIT but not submitted as a pull request due to llama.cpp's policy against AI-generated code.

I've been running Qwen3.6-35B-A3B locally on llama.cpp and noticed that prompt processing throughput gets too low with MTP. I got nerd-sniped. I'm not a C++ dev, I know almost nothing about ML, and I'm only scratching the surface of how LLMs work. What started as curiosity turned into a two-week rabbit hole of experiments and ended with a PoC that fully recovers the MTP PP overhead on GPU, above any expectation I had. TL;DR: instead of processing the last layer MoE FFN for the entire ubatch tokens usually 512-2048 tokens , this PoC processes only the output row usually 1 token during prefill . The result is PP TPS is back to the same as with MTP disabled, in my bench that was an uplift of 20%, keeping most of MTP's benefits to TG TPS, even with a slight drop in draft acceptance rate in one of the benchs. More details in the branch readme: https://codeberg.org/rocoe/llama.cpp/src/branch/masked-nextn-skip-catchup/README.md I worked with GLM 5.1 to write the code, Minimax M3 ran the tests and benchmarks on Modal and GLM 5.2 reviewed the work. GLM 5.1 is very smart and GLM 5.2 is capable of spotting deep side-effects in the code, no surprise it's at the top. Minimax M2.x were fast but lazy, M3 is a real leap and deserves more attention: it is smart, proactive, follows instructions and auto-corrects. I'm not opening a PR to llama.cpp because this is AI-generated code, which goes against their contribution policy, which I support. If you know llama.cpp internals, you're invited to take a look at the PoC. I'll be happy to work alongside you to open a PR with a more mature implementation. This work is released under MIT, same as llama.cpp. Happy to answer questions in the comments. Comments URL: https://news.ycombinator.com/item?id=48673852 https://news.ycombinator.com/item?id=48673852 Points: 2 Comments: 0