cd /news/large-language-models/why-your-quantized-llm-loses-its-mtp… · home topics large-language-models article
[ARTICLE · art-15461] src=dev.to pub= topic=large-language-models verified=true sentiment=· neutral

Why your quantized LLM loses its MTP heads and how to keep them

A developer discovered that standard quantization pipelines for large language models silently discard multi-token prediction (MTP) heads, causing speculative decoding speedups to vanish despite the base model appearing to function normally. The conversion toolchains treat any module outside vanilla transformer blocks as dead weight, dropping MTP layers named with non-standard prefixes like "model.mtp.layers" without warning unless debug logging is enabled. The engineer documented a fix involving extending the known tensor prefix list and adding explicit mapping rules for MTP heads before quantization.

read4 min publishedMay 27, 2026

Last month a teammate pinged me with a classic head-scratcher. He'd taken a base model with multi-token prediction (MTP) heads, ran it through a standard quantization pipeline to ship a smaller GGUF for edge inference, and the latency numbers came back worse than expected. The model still generated coherent text, but the speculative decoding speedup he'd built his benchmarks around was gone.

We poked around for an hour before the penny dropped. The MTP heads had silently been dropped on the floor during conversion. The base weights survived. The extra prediction heads — the whole reason MTP exists — did not.

If you've worked with models that ship MTP layers (the technique popularized by DeepSeek-V3, where the model predicts the next N tokens in parallel as draft tokens), you might have already run into this. The conversion toolchain assumes anything that isn't a vanilla transformer block is dead weight and trims it. Here's why it happens and how to stop it.

Quick refresher so we're on the same page. MTP (multi-token prediction) adds auxiliary heads on top of the base model that each predict a future token at offset +1, +2, +3, etc. At inference time you can use them as a built-in draft model for speculative decoding, which gives you a real throughput win without needing a separate small model.

The key thing: these heads are architecturally distinct from the regular lm_head

. They live in their own module tree, often named something like model.mtp.layers.0

, model.mtp.layers.1

and so on. They reference shared embeddings but have their own normalization, attention, and projection weights.

That naming convention is exactly what trips up the tooling.

Most quantization toolchains weren't designed with MTP in mind. They walk the state dict and apply transformations based on regex matches against expected layer names. Anything that doesn't match is either:

When I dug into the llama.cpp conversion script for the project, the relevant logic was essentially this pattern:

KNOWN_PREFIXES = ("model.layers.", "model.embed_tokens.", "model.norm.", "lm_head.")

for name, tensor in state_dict.items():
    if not name.startswith(KNOWN_PREFIXES):
        logger.debug(f"skipping unknown tensor: {name}")
        continue
    write_quantized(name, tensor)

The logger.debug

is the killer. Unless you run conversion with debug logging on, you never see the skip messages. The file converts "successfully" and you walk away thinking everything's fine.

GPTQ-style quantizers have a related but different failure mode. They calibrate against forward passes through the model, and if your calibration code only exercises the main lm_head

path, the MTP heads never see calibration data. Even if the weights are preserved, the resulting quantized heads are essentially random.

Here's the workflow I now use whenever I touch a model with MTP heads.

Before any conversion, dump the full state dict and grep for MTP-related modules. This sets your baseline.

from safetensors import safe_open

mtp_tensors = []
with safe_open("model.safetensors", framework="pt") as f:
    for key in f.keys():
        if "mtp" in key.lower() or "multi_token" in key.lower():
            mtp_tensors.append((key, f.get_slice(key).get_shape()))

for name, shape in mtp_tensors:
    print(f"{name}: {shape}")

print(f"\nTotal MTP tensors: {len(mtp_tensors)}")

Save this output. You'll diff against it after every conversion step.

For llama.cpp style converters, you need to extend the known prefix list and add a mapping rule for the MTP heads. The clean way is to subclass or monkey-patch rather than editing the upstream script directly:

from convert_hf_to_gguf import Model

class MTPAwareModel(Model):
    def map_tensor_name(self, name: str) -> str:
        if name.startswith("model.mtp."):
            return name.replace("model.mtp.", "mtp.")
        return super().map_tensor_name(name)

    def modify_tensors(self, data, name, bid):
        if "mtp" in name:
            return [(self.map_tensor_name(name), data)]
        return super().modify_tensors(data, name, bid)

The critical bit is overriding modify_tensors

— the default implementation has the silent skip we saw earlier.

If you're using GPTQ-style quantization, your calibration loop needs to actually hit the MTP heads. The default model(input_ids)

forward pass only routes through the main LM head. You need to force the MTP heads to see activations:

def calibration_forward(model, batch):
    outputs = model(**batch, output_hidden_states=True)

    hidden = outputs.hidden_states[-1]
    for i, head in enumerate(model.mtp.layers):
        shifted = hidden[:, :-(i + 1), :]
        _ = head(shifted)

    return outputs

Without this, your MTP heads quantize to garbage even though the file looks complete.

Re-run the inventory script against the converted file. The tensor count should match. If you went GGUF, you can also dump metadata:

./gguf-dump model-quantized.gguf | grep -i mtp

Then run a quick speculative decoding sanity check. If the MTP heads are intact and properly calibrated, you should see your tokens-per-second numbers match (or get very close to) the unquantized baseline's speedup ratio.

A few habits that have saved me repeated pain:

MTP is one of those features where the failure mode is invisible until you measure the thing the feature was supposed to improve. Treat the conversion pipeline as untrusted by default, and you'll avoid burning an afternoon on it like we did.

── more in #large-language-models 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/why-your-quantized-l…] indexed:0 read:4min 2026-05-27 ·