{"slug": "why-your-quantized-llm-loses-its-mtp-heads-and-how-to-keep-them", "title": "Why your quantized LLM loses its MTP heads and how to keep them", "summary": "A developer discovered that standard quantization pipelines for large language models silently discard multi-token prediction (MTP) heads, causing speculative decoding speedups to vanish despite the base model appearing to function normally. The conversion toolchains treat any module outside vanilla transformer blocks as dead weight, dropping MTP layers named with non-standard prefixes like \"model.mtp.layers\" without warning unless debug logging is enabled. The engineer documented a fix involving extending the known tensor prefix list and adding explicit mapping rules for MTP heads before quantization.", "body_md": "Last month a teammate pinged me with a classic head-scratcher. He'd taken a base model with multi-token prediction (MTP) heads, ran it through a standard quantization pipeline to ship a smaller GGUF for edge inference, and the latency numbers came back *worse* than expected. The model still generated coherent text, but the speculative decoding speedup he'd built his benchmarks around was gone.\n\nWe poked around for an hour before the penny dropped. The MTP heads had silently been dropped on the floor during conversion. The base weights survived. The extra prediction heads — the whole reason MTP exists — did not.\n\nIf you've worked with models that ship MTP layers (the technique popularized by DeepSeek-V3, where the model predicts the next N tokens in parallel as draft tokens), you might have already run into this. The conversion toolchain assumes anything that isn't a vanilla transformer block is dead weight and trims it. Here's why it happens and how to stop it.\n\nQuick refresher so we're on the same page. MTP (multi-token prediction) adds auxiliary heads on top of the base model that each predict a future token at offset +1, +2, +3, etc. At inference time you can use them as a built-in draft model for speculative decoding, which gives you a real throughput win without needing a separate small model.\n\nThe key thing: these heads are **architecturally distinct** from the regular `lm_head`\n\n. They live in their own module tree, often named something like `model.mtp.layers.0`\n\n, `model.mtp.layers.1`\n\nand so on. They reference shared embeddings but have their own normalization, attention, and projection weights.\n\nThat naming convention is exactly what trips up the tooling.\n\nMost quantization toolchains weren't designed with MTP in mind. They walk the state dict and apply transformations based on regex matches against expected layer names. Anything that doesn't match is either:\n\nWhen I dug into the llama.cpp conversion script for the project, the relevant logic was essentially this pattern:\n\n```\n# Simplified version of what most converters do\nKNOWN_PREFIXES = (\"model.layers.\", \"model.embed_tokens.\", \"model.norm.\", \"lm_head.\")\n\nfor name, tensor in state_dict.items():\n    if not name.startswith(KNOWN_PREFIXES):\n        # MTP heads land here and get skipped\n        logger.debug(f\"skipping unknown tensor: {name}\")\n        continue\n    write_quantized(name, tensor)\n```\n\nThe `logger.debug`\n\nis the killer. Unless you run conversion with debug logging on, you never see the skip messages. The file converts \"successfully\" and you walk away thinking everything's fine.\n\nGPTQ-style quantizers have a related but different failure mode. They calibrate against forward passes through the model, and if your calibration code only exercises the main `lm_head`\n\npath, the MTP heads never see calibration data. Even if the weights are preserved, the resulting quantized heads are essentially random.\n\nHere's the workflow I now use whenever I touch a model with MTP heads.\n\nBefore any conversion, dump the full state dict and grep for MTP-related modules. This sets your baseline.\n\n``` python\nfrom safetensors import safe_open\n\nmtp_tensors = []\nwith safe_open(\"model.safetensors\", framework=\"pt\") as f:\n    for key in f.keys():\n        # Adjust prefix to whatever your model uses\n        if \"mtp\" in key.lower() or \"multi_token\" in key.lower():\n            mtp_tensors.append((key, f.get_slice(key).get_shape()))\n\nfor name, shape in mtp_tensors:\n    print(f\"{name}: {shape}\")\n\nprint(f\"\\nTotal MTP tensors: {len(mtp_tensors)}\")\n```\n\nSave this output. You'll diff against it after every conversion step.\n\nFor llama.cpp style converters, you need to extend the known prefix list and add a mapping rule for the MTP heads. The clean way is to subclass or monkey-patch rather than editing the upstream script directly:\n\n``` python\nfrom convert_hf_to_gguf import Model\n\nclass MTPAwareModel(Model):\n    def map_tensor_name(self, name: str) -> str:\n        # Handle MTP heads explicitly before falling through\n        if name.startswith(\"model.mtp.\"):\n            # Preserve the layer index and submodule path\n            # Output name needs to match what your loader expects\n            return name.replace(\"model.mtp.\", \"mtp.\")\n        return super().map_tensor_name(name)\n\n    def modify_tensors(self, data, name, bid):\n        # Skip the parent class's filter for MTP layers\n        if \"mtp\" in name:\n            return [(self.map_tensor_name(name), data)]\n        return super().modify_tensors(data, name, bid)\n```\n\nThe critical bit is overriding `modify_tensors`\n\n— the default implementation has the silent skip we saw earlier.\n\nIf you're using GPTQ-style quantization, your calibration loop needs to actually hit the MTP heads. The default `model(input_ids)`\n\nforward pass only routes through the main LM head. You need to force the MTP heads to see activations:\n\n``` python\ndef calibration_forward(model, batch):\n    # Standard forward populates main path activations\n    outputs = model(**batch, output_hidden_states=True)\n\n    # Manually invoke MTP heads using the final hidden state\n    # This ensures each head gets calibration statistics\n    hidden = outputs.hidden_states[-1]\n    for i, head in enumerate(model.mtp.layers):\n        # Shift input so head i predicts token at position +i+1\n        shifted = hidden[:, :-(i + 1), :]\n        _ = head(shifted)\n\n    return outputs\n```\n\nWithout this, your MTP heads quantize to garbage even though the file looks complete.\n\nRe-run the inventory script against the converted file. The tensor count should match. If you went GGUF, you can also dump metadata:\n\n```\n# llama.cpp ships a metadata inspection tool\n./gguf-dump model-quantized.gguf | grep -i mtp\n```\n\nThen run a quick speculative decoding sanity check. If the MTP heads are intact and properly calibrated, you should see your tokens-per-second numbers match (or get very close to) the unquantized baseline's speedup ratio.\n\nA few habits that have saved me repeated pain:\n\nMTP is one of those features where the failure mode is invisible until you measure the thing the feature was supposed to improve. Treat the conversion pipeline as untrusted by default, and you'll avoid burning an afternoon on it like we did.", "url": "https://wpnews.pro/news/why-your-quantized-llm-loses-its-mtp-heads-and-how-to-keep-them", "canonical_source": "https://dev.to/alanwest/why-your-quantized-llm-loses-its-mtp-heads-and-how-to-keep-them-m7h", "published_at": "2026-05-27 16:00:08+00:00", "updated_at": "2026-05-27 16:11:39.166876+00:00", "lang": "en", "topics": ["large-language-models", "machine-learning", "artificial-intelligence", "ai-infrastructure", "mlops"], "entities": ["DeepSeek-V3", "GGUF", "MTP"], "alternates": {"html": "https://wpnews.pro/news/why-your-quantized-llm-loses-its-mtp-heads-and-how-to-keep-them", "markdown": "https://wpnews.pro/news/why-your-quantized-llm-loses-its-mtp-heads-and-how-to-keep-them.md", "text": "https://wpnews.pro/news/why-your-quantized-llm-loses-its-mtp-heads-and-how-to-keep-them.txt", "jsonld": "https://wpnews.pro/news/why-your-quantized-llm-loses-its-mtp-heads-and-how-to-keep-them.jsonld"}}