Why your quantized LLM loses its MTP heads and how to keep them

A developer discovered that standard quantization pipelines for large language models silently discard multi-token prediction (MTP) heads, causing speculative decoding speedups to vanish despite the base model appearing to function normally. The conversion toolchains treat any module outside vanilla transformer blocks as dead weight, dropping MTP layers named with non-standard prefixes like "model.mtp.layers" without warning unless debug logging is enabled. The engineer documented a fix involving extending the known tensor prefix list and adding explicit mapping rules for MTP heads before quantization.

Last month a teammate pinged me with a classic head-scratcher. He'd taken a base model with multi-token prediction MTP heads, ran it through a standard quantization pipeline to ship a smaller GGUF for edge inference, and the latency numbers came back worse than expected. The model still generated coherent text, but the speculative decoding speedup he'd built his benchmarks around was gone. We poked around for an hour before the penny dropped. The MTP heads had silently been dropped on the floor during conversion. The base weights survived. The extra prediction heads — the whole reason MTP exists — did not. If you've worked with models that ship MTP layers the technique popularized by DeepSeek-V3, where the model predicts the next N tokens in parallel as draft tokens , you might have already run into this. The conversion toolchain assumes anything that isn't a vanilla transformer block is dead weight and trims it. Here's why it happens and how to stop it. Quick refresher so we're on the same page. MTP multi-token prediction adds auxiliary heads on top of the base model that each predict a future token at offset +1, +2, +3, etc. At inference time you can use them as a built-in draft model for speculative decoding, which gives you a real throughput win without needing a separate small model. The key thing: these heads are architecturally distinct from the regular lm head . They live in their own module tree, often named something like model.mtp.layers.0 , model.mtp.layers.1 and so on. They reference shared embeddings but have their own normalization, attention, and projection weights. That naming convention is exactly what trips up the tooling. Most quantization toolchains weren't designed with MTP in mind. They walk the state dict and apply transformations based on regex matches against expected layer names. Anything that doesn't match is either: When I dug into the llama.cpp conversion script for the project, the relevant logic was essentially this pattern: Simplified version of what most converters do KNOWN PREFIXES = "model.layers.", "model.embed tokens.", "model.norm.", "lm head." for name, tensor in state dict.items : if not name.startswith KNOWN PREFIXES : MTP heads land here and get skipped logger.debug f"skipping unknown tensor: {name}" continue write quantized name, tensor The logger.debug is the killer. Unless you run conversion with debug logging on, you never see the skip messages. The file converts "successfully" and you walk away thinking everything's fine. GPTQ-style quantizers have a related but different failure mode. They calibrate against forward passes through the model, and if your calibration code only exercises the main lm head path, the MTP heads never see calibration data. Even if the weights are preserved, the resulting quantized heads are essentially random. Here's the workflow I now use whenever I touch a model with MTP heads. Before any conversion, dump the full state dict and grep for MTP-related modules. This sets your baseline. python from safetensors import safe open mtp tensors = with safe open "model.safetensors", framework="pt" as f: for key in f.keys : Adjust prefix to whatever your model uses if "mtp" in key.lower or "multi token" in key.lower : mtp tensors.append key, f.get slice key .get shape for name, shape in mtp tensors: print f"{name}: {shape}" print f"\nTotal MTP tensors: {len mtp tensors }" Save this output. You'll diff against it after every conversion step. For llama.cpp style converters, you need to extend the known prefix list and add a mapping rule for the MTP heads. The clean way is to subclass or monkey-patch rather than editing the upstream script directly: python from convert hf to gguf import Model class MTPAwareModel Model : def map tensor name self, name: str - str: Handle MTP heads explicitly before falling through if name.startswith "model.mtp." : Preserve the layer index and submodule path Output name needs to match what your loader expects return name.replace "model.mtp.", "mtp." return super .map tensor name name def modify tensors self, data, name, bid : Skip the parent class's filter for MTP layers if "mtp" in name: return self.map tensor name name , data return super .modify tensors data, name, bid The critical bit is overriding modify tensors — the default implementation has the silent skip we saw earlier. If you're using GPTQ-style quantization, your calibration loop needs to actually hit the MTP heads. The default model input ids forward pass only routes through the main LM head. You need to force the MTP heads to see activations: python def calibration forward model, batch : Standard forward populates main path activations outputs = model batch, output hidden states=True Manually invoke MTP heads using the final hidden state This ensures each head gets calibration statistics hidden = outputs.hidden states -1 for i, head in enumerate model.mtp.layers : Shift input so head i predicts token at position +i+1 shifted = hidden :, :- i + 1 , : = head shifted return outputs Without this, your MTP heads quantize to garbage even though the file looks complete. Re-run the inventory script against the converted file. The tensor count should match. If you went GGUF, you can also dump metadata: llama.cpp ships a metadata inspection tool ./gguf-dump model-quantized.gguf | grep -i mtp Then run a quick speculative decoding sanity check. If the MTP heads are intact and properly calibrated, you should see your tokens-per-second numbers match or get very close to the unquantized baseline's speedup ratio. A few habits that have saved me repeated pain: MTP is one of those features where the failure mode is invisible until you measure the thing the feature was supposed to improve. Treat the conversion pipeline as untrusted by default, and you'll avoid burning an afternoon on it like we did.