{"slug": "mtp-benchmark", "title": "MTP benchmark", "summary": "This article presents benchmark results comparing the performance of a Qwen3.6 model running in standard mode versus with Multi-Token Prediction (MTP) enabled. The MTP configuration with a draft of 3 tokens achieved a significantly higher aggregate throughput of 16.8 tokens per second compared to 7.0 tok/s without MTP, while also reducing wall time from 201 seconds to 83.8 seconds.", "body_md": "./llama-server -m ../qwen3.6-q8_0.gguf -np 1 --chat-template-kwargs \"{\\\"preserve_thinking\\\": true}\"\ncode_python pred= 192 draft= 0 acc= 0 rate=n/a tok/s=7.0\ncode_cpp pred= 192 draft= 0 acc= 0 rate=n/a tok/s=7.3\nexplain_concept pred= 192 draft= 0 acc= 0 rate=n/a tok/s=7.3\nsummarize pred= 53 draft= 0 acc= 0 rate=n/a tok/s=7.1\nqa_factual pred= 177 draft= 0 acc= 0 rate=n/a tok/s=7.0\ntranslation pred= 22 draft= 0 acc= 0 rate=n/a tok/s=7.7\ncreative_short pred= 192 draft= 0 acc= 0 rate=n/a tok/s=7.1\nstepwise_math pred= 192 draft= 0 acc= 0 rate=n/a tok/s=7.2\nlong_code_review pred= 192 draft= 0 acc= 0 rate=n/a tok/s=7.0\nAggregate: {\n\"n_requests\": 9,\n\"total_predicted\": 1404,\n\"total_draft\": 0,\n\"total_draft_accepted\": 0,\n\"aggregate_accept_rate\": null,\n\"wall_s_total\": 201.07\n}\n./llama-server -m ../qwen3.6-q8_0-mtp.gguf -np 1 --chat-template-kwargs \"{\\\"preserve_thinking\\\": true}\" --spec-type mtp --spec-draft-n-max 3\ncode_python pred= 192 draft= 153 acc= 139 rate=0.908 tok/s=21.6\ncode_cpp pred= 192 draft= 176 acc= 132 rate=0.750 tok/s=18.7\nexplain_concept pred= 192 draft= 191 acc= 126 rate=0.660 tok/s=16.3\nsummarize pred= 55 draft= 51 acc= 37 rate=0.726 tok/s=17.9\nqa_factual pred= 177 draft= 174 acc= 118 rate=0.678 tok/s=16.5\ntranslation pred= 22 draft= 24 acc= 13 rate=0.542 tok/s=13.9\ncreative_short pred= 192 draft= 200 acc= 123 rate=0.615 tok/s=15.8\nstepwise_math pred= 192 draft= 171 acc= 133 rate=0.778 tok/s=19.3\nlong_code_review pred= 192 draft= 179 acc= 131 rate=0.732 tok/s=18.0\nAggregate: {\n\"n_requests\": 9,\n\"total_predicted\": 1406,\n\"total_draft\": 1319,\n\"total_draft_accepted\": 952,\n\"aggregate_accept_rate\": 0.7218,\n\"wall_s_total\": 83.8\n}\n./llama-server -m ../qwen3.6-q8_0-mtp.gguf -np 1 --chat-template-kwargs \"{\\\"preserve_thinking\\\": true}\" --spec-type mtp --spec-draft-n-max 2\ncode_python pred= 192 draft= 134 acc= 123 rate=0.918 tok/s=17.4\ncode_cpp pred= 192 draft= 145 acc= 118 rate=0.814 tok/s=16.5\nexplain_concept pred= 192 draft= 148 acc= 116 rate=0.784 tok/s=16.1\nsummarize pred= 55 draft= 44 acc= 32 rate=0.727 tok/s=15.6\nqa_factual pred= 192 draft= 132 acc= 125 rate=0.947 tok/s=18.2\ntranslation pred= 22 draft= 18 acc= 12 rate=0.667 tok/s=15.2\ncreative_short pred= 192 draft= 149 acc= 116 rate=0.778 tok/s=16.1\nstepwise_math pred= 192 draft= 139 acc= 121 rate=0.871 tok/s=17.2\nlong_code_review pred= 192 draft= 153 acc= 114 rate=0.745 tok/s=15.6\nAggregate: {\n\"n_requests\": 9,\n\"total_predicted\": 1421,\n\"total_draft\": 1062,\n\"total_draft_accepted\": 877,\n\"aggregate_accept_rate\": 0.8258,\n\"wall_s_total\": 90.44\n}\nllama-server -m ../qwen3.6/Qwen3.6-27B-Q8_0.gguf -hfd unsloth/Qwen3.5-0.8B-GGUF:Q8_0 --spec-draft-n-max 16 -np 1 --chat-template-kwargs \"{\\\"preserve_thinking\\\": true}\"\ncode_python pred= 192 draft= 188 acc= 156 rate=0.830 tok/s=26.4\ncode_cpp pred= 192 draft= 201 acc= 126 rate=0.627 tok/s=16.8\nexplain_concept pred= 192 draft= 263 acc= 112 rate=0.426 tok/s=12.7\nsummarize pred= 57 draft= 63 acc= 39 rate=0.619 tok/s=16.9\nqa_factual pred= 192 draft= 178 acc= 177 rate=0.994 tok/s=47.7\ntranslation pred= 23 draft= 18 acc= 15 rate=0.833 tok/s=18.7\ncreative_short pred= 192 draft= 189 acc= 120 rate=0.635 tok/s=15.4\nstepwise_math pred= 192 draft= 190 acc= 148 rate=0.779 tok/s=22.3\nlong_code_review pred= 192 draft= 207 acc= 120 rate=0.580 tok/s=14.5\nAggregate: {\n\"n_requests\": 9,\n\"total_predicted\": 1424,\n\"total_draft\": 1497,\n\"total_draft_accepted\": 1013,\n\"aggregate_accept_rate\": 0.6767,\n\"wall_s_total\": 81.39\n}\nllama-server -m ../qwen3.6/Qwen3.6-27B-Q8_0.gguf -hfd unsloth/Qwen3.5-0.8B-GGUF:Q8_0 --spec-draft-n-max 64 -np 1 --chat-template-kwargs \"{\\\"preserve_thinking\\\": true}\"\ncode_python pred= 192 draft= 174 acc= 159 rate=0.914 tok/s=27.2\ncode_cpp pred= 192 draft= 138 acc= 120 rate=0.870 tok/s=15.0\nexplain_concept pred= 192 draft= 170 acc= 101 rate=0.594 tok/s=11.4\nsummarize pred= 55 draft= 48 acc= 36 rate=0.750 tok/s=14.6\nqa_factual pred= 177 draft= 126 acc= 106 rate=0.841 tok/s=13.9\ntranslation pred= 22 draft= 13 acc= 13 rate=1.000 tok/s=16.5\ncreative_short pred= 192 draft= 136 acc= 104 rate=0.765 tok/s=12.8\nstepwise_math pred= 192 draft= 172 acc= 147 rate=0.855 tok/s=22.0\nlong_code_review pred= 192 draft= 160 acc= 111 rate=0.694 tok/s=13.0\nAggregate: {\n\"n_requests\": 9,\n\"total_predicted\": 1406,\n\"total_draft\": 1137,\n\"total_draft_accepted\": 897,\n\"aggregate_accept_rate\": 0.7889,\n\"wall_s_total\": 97.13\n}", "url": "https://wpnews.pro/news/mtp-benchmark", "canonical_source": "https://gist.github.com/am17an/228edfb84ed082aa88e3865d6fa27090", "published_at": "2026-05-04 05:15:44+00:00", "updated_at": "2026-05-22 19:37:36.607039+00:00", "lang": "en", "topics": ["large-language-models", "machine-learning", "artificial-intelligence", "open-source", "developer-tools"], "entities": ["llama-server", "Qwen"], "alternates": {"html": "https://wpnews.pro/news/mtp-benchmark", "markdown": "https://wpnews.pro/news/mtp-benchmark.md", "text": "https://wpnews.pro/news/mtp-benchmark.txt", "jsonld": "https://wpnews.pro/news/mtp-benchmark.jsonld"}}