{"slug": "what-s-in-a-gguf-besides-the-weights-and-what-s-still-missing", "title": "What's in a GGUF, besides the weights - and what's still missing?", "summary": "The GGUF file format consolidates all necessary model components—including weights, chat templates, and special tokens—into a single file, offering a more ergonomic alternative to the scattered JSON files typical of safetensors repos or the layered OCI structure used by Ollama. However, the format still relies on external Jinja2 template interpreters to handle complex conversational features like reasoning blocks, tool calls, and multimedia messages, with performance varying across implementations such as llama.cpp's custom Jinja engine and minijinja. The absence of a standardized, high-performance template execution system within GGUF itself remains a notable gap for local LLM applications.", "body_md": "# What's in a GGUF, besides the weights - and what's still missing?\n\nGGUF is the file format that [llama.cpp](https://github.com/ggml-org/llama.cpp) uses for language models.\n\nThe *really neat* thing about GGUF is that it's just one file.\nCompare this to [a typical safetensors repo on huggingface](https://huggingface.co/Qwen/Qwen3.5-0.8B/tree/main), where there's a pile of necessary JSON files scattered around - or to [a typical ollama model](https://ollama.com/library/qwen3.5:0.8b), which is an OCI with layers json, go templates, etc inside.\n\nThe contents are roughly the same, but GGUF makes it more ergonomic by keeping all this *stuff* in a single file.\n\nBut what is this *stuff*, and does it cover everything needed?\n\n## Chat Templates\n\nConversational language models are trained on sequences that follow a specific format, that sort of look like a conversation.\n\nFor instance, Gemma4's format looks like this:\n\n```\n<|turn>user\nHi there!<turn|>\n<|turn>model\nHi there, how can I help you today?<turn|>\n```\n\n...and LFM2's format template looks like this:\n\n```\n<s>\n<|im_start|>user Hi there!<|im_end|>\n<|im_start|>assistant Hi there, how can I help you today?<|im_end|>\n```\n\n..and that's just a basic example. It gets significantly more complicated once we start adding fancy features, like how and when to format reasoning blocks, how to present tool descriptions, tool calls and their responses, as well as how to encode multimedia messages (images, audio, video, etc.).\n\nAll this is handled by a *chat template*, a script in the jinja2 templating language. See for instance the [chat template that ships with Gemma 4](https://huggingface.co/google/gemma-4-E4B-it/raw/main/chat_template.jinja). The default chat template is stored under the `tokenizer.chat_template`\n\nkey in the GGUF metadata. A model *may* have multiple chat templates. E.g. one with tool calling support, and one without. Most commonly models ship with a single monolithic chat template, that will only bother with the tool calling stuff when tools are specified, but you do need to look for tool-specific chat templates in some models.\n\nJinja2 is a programming language, no doubt about it - it has loops, conditionals, assignments, lists, dictionaries, etc. - so any conversational LLM application must ship a programming language interpreter capable of running programs like the ~250 line jinja script that gemma ships with, every time a new message is added.\n\nHuggingface transformers uses jinja2 (the classic python lib), llama.cpp's llama-server and llama-cli use [their own jinja implementation](https://github.com/ggml-org/llama.cpp/tree/85d482e6b6706648070f620797e54f1a6a0ff3d8/common/jinja) (not to be confused with the somewhat baffling [llama_chat_apply_template](https://github.com/ggml-org/llama.cpp/blob/85d482e6b6706648070f620797e54f1a6a0ff3d8/src/llama-chat.cpp#L240) exposed in the libllama API, which hardcodes a handful of chat formats directly in C++ — a charming relic from before the real jinja implementation landed), and NobodyWho uses [minijinja](https://github.com/mitsuhiko/minijinja), which is a reimplementation of jinja by its original creator in pure rust (not to be confused with [minja](https://github.com/google/minja), a minimalist jinja library that was once used by llama.cpp).\n\nThere is [a sizeable performance difference](https://gitlab.com/AsbjornOlling/chat-template-benchmark) between those jinja implementations. But chat templating isn't exactly the performance bottleneck in a local LLM application, so it's not worth bickering about.\n\n## Special Tokens\n\nLanguage models will readily output the next token for any sequence of tokens you feed it, forever - so we need some kind of way to stop them.\n\nThe typical solution for this is some kind of end-of-sequence token. The idea is for the inference engine to stop generation, whenever the model emits such a token.\n\nThis is an example of a special token. Special tokens are generally tokens that have a broader semantic meaning than the letters they tokenize to.\nThey're generally tokens that shouldn't be shown to the user, although they (usually) still have a textual representation, so they *can* be.\n\nFor example, a few tokens for Gemma4:\n\n| Token ID | Textual representation | Purpose |\n|---|---|---|\n| 1 | `<eos>` |\nEnd of sequence, model emits this to stop generation. |\n| 2 | `<bos>` |\nBeginning of sequence, is prepended to inputs. |\n| 46 | `<|tool_call>` |\nMarks beginning of a tool call. |\n| 47 | `<tool_call|>` |\nMarks end of a tool call. |\n| 105 | `<|turn>` |\nBeginning of a conversational turn. |\n| 106 | `<turn|>` |\nEnd of a conversational turn. |\n\n## Sampler Configuration\n\nLanguage models output a distribution of next-token-probabilities. Selecting a token from this distribution is called sampling.\n\nThe simplest approach is to randomly select from the weighted distribution.\n\nBut we can do more. It has been shown that you can get even better results by applying some transformations to the probability distribution before selecting a concrete token.\n\nWhen research labs ship a new model, they often include a specific recommended sampler configuration.\n\nI have all too often seen people go copy-paste these values from a markdown file somewhere, to get better responses from the model.\n\nTo save users that step, we started uploading a small collection of curated models to [our huggingface page](https://huggingface.co/NobodyWho/models), bundled with the recommended sampler settings in a format we came up with ourselves. It worked, but it meant every model needed a NobodyWho-side conversion to be useful.\n\nHappily, a [recent addition to the GGUF format](https://github.com/ggml-org/llama.cpp/pull/17120) lets the sampler chain be specified directly in the model file. That makes our custom format obsolete — which is exactly the outcome we wanted.\n\n## Sampler Chain Sequence\n\nI quite like [this web app](https://artefact2.github.io/llm-sampling/) for quickly getting a feel for what the different types of sampler steps do.\nIf you drag-and-drop the individual steps, you'll notice that the order of sampling steps can make a big difference for what the final distribution is like.\n\nIt's frustrating to me that most sampler config formats (including ollama images' json files and HF's `generation_config.json`\n\n) don't have any way of specifying the order of sampling steps.\nI'm quite happy that the GGUF standard for this includes the `general.sampling.sequence`\n\nfield, which lets you specify the order.\n\nBut still, many GGUF models will omit this field and expect the implicit order of \"whatever llama.cpp does by default\". Fine. Implicit, but it works.\n\n## What's still missing?\n\nGood inference engines aim to provide a unified interface for different language models.\nThe *extra stuff* in GGUF metadata covers a lot of this, so parsing and using that stuff lets us avoid a lot of model-specific codepaths.\n\n### Still Missing: Tool calling formats\n\nOne thing that seemingly every inference engine has hardcoded paths for is parsing different tool call formats.\n\nFor instance, a Qwen3 tool call looks like this:\n\n```\n<tool_call>{\"name\": \"get_weather\", \"arguments\": {\"location\": \"Copenhagen\"}}</tool_call>\n```\n\na Qwen3.5 tool call looks like this:\n\n```\n<tool_call>\n<function=get_weather>\n<parameter=city>\nCopenhagen\n</parameter>\n</function>\n</tool_call>\n```\n\n...and a Gemma4 tool call looks like this:\n\n```\n<|tool_call>call:get_weather{city:<|\"|>Copenhagen<|\"|>}<tool_call|>\n```\n\nCurrently, a bunch of different inference engines rush to implement parsers whenever a new model is released.\n\nIt would be a fantastic addition to the GGUF standard if model files would include a grammar, which we could derive a parser from.\n\nIn NobodyWho, we go one extra (somewhat unique?) step wrt. tool calling, because we generate a unique constraining grammar for the specific tools passed.\nThis means that we can guarantee type-safety for the tool calls. This is *especially* useful for the smallest models (1B or less) which can sometimes mess up and e.g. pass a float when an integer is required.\n\nWhile specifying a grammar that we could derive a generic tool calling parser from would be useful, NobodyWho would still need to implement the functions to generate grammars for each specific tool passed.\n\nIt's an interesting problem to come up with a sort of meta-grammar format, which we could use to derive concrete grammars for specific tools, from which we could derive parsers.\n\n### Still Missing: Think tokens\n\nThis is definitely the easiest one to just add.\n\nThe [upstream huggingface repos](https://huggingface.co/google/gemma-4-E2B/blob/main/tokenizer_config.json#L31) have begun to include a `think_token`\n\nfield.\nThis is super useful for separating the thinking section of a generated output, since it should generally either be stripped or rendered differently from the main output.\n\nSomewhy, [the downstream GGUF conversions](https://huggingface.co/unsloth/gemma-4-E2B-it-GGUF/blob/main/gemma-4-E2B-it-Q4_0.gguf) typically don't include this one.\nThis makes GGUF-based inference engines incapable of separating the think streams from the main output, without having to write specific codepaths for specific model-families.\n\nAdding `think_token`\n\nto the standard GGUF conversion pipeline would just fix this. We should do that.\n\n### Still Missing: Projection Models\n\nMultimodal LLM interaction (i.e. letting the LLM natively see images and audio, rather than just text), requires an additional model for processing the non-text input, known as a \"projection model\".\n\nThe convention is to then pass in *two* GGUF files: one GGUF for the main language model, and a smaller model for processing images and audio.\n\nThis breaks the just-one-file ergonomics. It would be a great improvement if the single GGUF file could bundle the projection model weights and config inside the main file.\n\nThe projection model is often ~1GB in size - enough of an overhead that we definitely want to skip it when it's not used. But I think it's reasonable to provide two variants of the GGUF: one with projection weights, and one without. That could get us back to the situation of managing just one url to download, just one file to cache on disk, etc.\n\n### Still Missing: List of Supported Features\n\nModels just don't support the same stuff, and it's not easily detectable from the GGUF file what stuff is actually supported.\n\nSome models support image ingestion, some don't. The best way to handle this right now, is to assume support for images when a projection model is passed in.\n\nSome models natively support tool calling, some don't. The best way to handle this right now, is to do substring matching on the chat template, to see if it tries to render the list of tool json schemas. This is obviously hacky.\n\nSome models will emit thinking blocks, some won't. Since thinking tags are typically missing from GGUF metadata, I'm not sure if there is any good way to see if we expect thinking blocks from a model.\n\nI would love for the GGUF community to start adding feature flags to the model files, such that model-agnostic inference libraries like ours can more consistently provide error messages and warnings when a consuming program tries to e.g. do tool calling on a model that doesn't natively support tool calling.\n\n## Conclusion\n\nI love GGUF.\n\nI love it because it's just a single file, that covers all of the *stuff* needed to run a model *correctly* without having to add a bunch of model-specific codepaths.\n\nI also love GGUF because it's an open, extensible format, with a strong community around it.\n\nThis means that we can work together to strengthen the standard, and keep a great developer experience while being able to easily swap out models in an application, without having to re-write any code.\n\nThis post covers a bunch of stuff that's already great about the GGUF metadata, and a bunch of things that we'd like to improve. Keep an eye on our huggingface page and the llama.cpp issues board in the coming weeks, if you'd like to follow our work in this area.\n\n*This post was written entirely by a human. No words were made up by the machine.*\n\nPublished May 14, 2026", "url": "https://wpnews.pro/news/what-s-in-a-gguf-besides-the-weights-and-what-s-still-missing", "canonical_source": "https://nobodywho.ooo/posts/whats-in-a-gguf/", "published_at": "2026-05-14 00:00:00+00:00", "updated_at": "2026-06-04 13:17:56.094472+00:00", "lang": "en", "topics": ["large-language-models", "artificial-intelligence", "machine-learning", "ai-infrastructure", "ai-tools"], "entities": ["GGUF", "llama.cpp", "Hugging Face", "Ollama", "Gemma4", "LFM2"], "alternates": {"html": "https://wpnews.pro/news/what-s-in-a-gguf-besides-the-weights-and-what-s-still-missing", "markdown": "https://wpnews.pro/news/what-s-in-a-gguf-besides-the-weights-and-what-s-still-missing.md", "text": "https://wpnews.pro/news/what-s-in-a-gguf-besides-the-weights-and-what-s-still-missing.txt", "jsonld": "https://wpnews.pro/news/what-s-in-a-gguf-besides-the-weights-and-what-s-still-missing.jsonld"}}