# Why your quantized LLM loses its MTP heads and how to keep them

> Source: <https://dev.to/alanwest/why-your-quantized-llm-loses-its-mtp-heads-and-how-to-keep-them-m7h>
> Published: 2026-05-27 16:00:08+00:00

Last month a teammate pinged me with a classic head-scratcher. He'd taken a base model with multi-token prediction (MTP) heads, ran it through a standard quantization pipeline to ship a smaller GGUF for edge inference, and the latency numbers came back *worse* than expected. The model still generated coherent text, but the speculative decoding speedup he'd built his benchmarks around was gone.

We poked around for an hour before the penny dropped. The MTP heads had silently been dropped on the floor during conversion. The base weights survived. The extra prediction heads — the whole reason MTP exists — did not.

If you've worked with models that ship MTP layers (the technique popularized by DeepSeek-V3, where the model predicts the next N tokens in parallel as draft tokens), you might have already run into this. The conversion toolchain assumes anything that isn't a vanilla transformer block is dead weight and trims it. Here's why it happens and how to stop it.

Quick refresher so we're on the same page. MTP (multi-token prediction) adds auxiliary heads on top of the base model that each predict a future token at offset +1, +2, +3, etc. At inference time you can use them as a built-in draft model for speculative decoding, which gives you a real throughput win without needing a separate small model.

The key thing: these heads are **architecturally distinct** from the regular `lm_head`

. They live in their own module tree, often named something like `model.mtp.layers.0`

, `model.mtp.layers.1`

and so on. They reference shared embeddings but have their own normalization, attention, and projection weights.

That naming convention is exactly what trips up the tooling.

Most quantization toolchains weren't designed with MTP in mind. They walk the state dict and apply transformations based on regex matches against expected layer names. Anything that doesn't match is either:

When I dug into the llama.cpp conversion script for the project, the relevant logic was essentially this pattern:

```
# Simplified version of what most converters do
KNOWN_PREFIXES = ("model.layers.", "model.embed_tokens.", "model.norm.", "lm_head.")

for name, tensor in state_dict.items():
    if not name.startswith(KNOWN_PREFIXES):
        # MTP heads land here and get skipped
        logger.debug(f"skipping unknown tensor: {name}")
        continue
    write_quantized(name, tensor)
```

The `logger.debug`

is the killer. Unless you run conversion with debug logging on, you never see the skip messages. The file converts "successfully" and you walk away thinking everything's fine.

GPTQ-style quantizers have a related but different failure mode. They calibrate against forward passes through the model, and if your calibration code only exercises the main `lm_head`

path, the MTP heads never see calibration data. Even if the weights are preserved, the resulting quantized heads are essentially random.

Here's the workflow I now use whenever I touch a model with MTP heads.

Before any conversion, dump the full state dict and grep for MTP-related modules. This sets your baseline.

``` python
from safetensors import safe_open

mtp_tensors = []
with safe_open("model.safetensors", framework="pt") as f:
    for key in f.keys():
        # Adjust prefix to whatever your model uses
        if "mtp" in key.lower() or "multi_token" in key.lower():
            mtp_tensors.append((key, f.get_slice(key).get_shape()))

for name, shape in mtp_tensors:
    print(f"{name}: {shape}")

print(f"\nTotal MTP tensors: {len(mtp_tensors)}")
```

Save this output. You'll diff against it after every conversion step.

For llama.cpp style converters, you need to extend the known prefix list and add a mapping rule for the MTP heads. The clean way is to subclass or monkey-patch rather than editing the upstream script directly:

``` python
from convert_hf_to_gguf import Model

class MTPAwareModel(Model):
    def map_tensor_name(self, name: str) -> str:
        # Handle MTP heads explicitly before falling through
        if name.startswith("model.mtp."):
            # Preserve the layer index and submodule path
            # Output name needs to match what your loader expects
            return name.replace("model.mtp.", "mtp.")
        return super().map_tensor_name(name)

    def modify_tensors(self, data, name, bid):
        # Skip the parent class's filter for MTP layers
        if "mtp" in name:
            return [(self.map_tensor_name(name), data)]
        return super().modify_tensors(data, name, bid)
```

The critical bit is overriding `modify_tensors`

— the default implementation has the silent skip we saw earlier.

If you're using GPTQ-style quantization, your calibration loop needs to actually hit the MTP heads. The default `model(input_ids)`

forward pass only routes through the main LM head. You need to force the MTP heads to see activations:

``` python
def calibration_forward(model, batch):
    # Standard forward populates main path activations
    outputs = model(**batch, output_hidden_states=True)

    # Manually invoke MTP heads using the final hidden state
    # This ensures each head gets calibration statistics
    hidden = outputs.hidden_states[-1]
    for i, head in enumerate(model.mtp.layers):
        # Shift input so head i predicts token at position +i+1
        shifted = hidden[:, :-(i + 1), :]
        _ = head(shifted)

    return outputs
```

Without this, your MTP heads quantize to garbage even though the file looks complete.

Re-run the inventory script against the converted file. The tensor count should match. If you went GGUF, you can also dump metadata:

```
# llama.cpp ships a metadata inspection tool
./gguf-dump model-quantized.gguf | grep -i mtp
```

Then run a quick speculative decoding sanity check. If the MTP heads are intact and properly calibrated, you should see your tokens-per-second numbers match (or get very close to) the unquantized baseline's speedup ratio.

A few habits that have saved me repeated pain:

MTP is one of those features where the failure mode is invisible until you measure the thing the feature was supposed to improve. Treat the conversion pipeline as untrusted by default, and you'll avoid burning an afternoon on it like we did.
