GGUF & Modelfile: The Power User's Guide to Local LLMs

The article explains how power users can download GGUF (GPT-Generated Unified Format) model files directly from Hugging Face, quantize them (using Q4_K_M as the optimal balance of size and quality), and import them into Ollama using a Modelfile—a configuration file similar to a Dockerfile that allows customization of parameters like context length, temperature, and system prompts. It provides step-by-step instructions for creating custom models, including performance tuning, chat template formatting, and multi-GPU support, as well as troubleshooting tips for memory issues and exporting models back to GGUF format for use in other tools like llama.cpp.

GGUF & Modelfile: The Power User's Guide to Local LLMs Beyond ollama pull — download any model from Hugging Face, quantize it, customize it, and import it into Ollama. What's GGUF? GGUF GPT-Generated Unified Format is the standard file format for running LLMs locally. Think of it as the .mp3 of AI models: - Compressed — 70-85% smaller than the original float16 weights - Fast — optimized for CPU and GPU inference - Portable — one file contains the entire model - Metadata-rich — includes tokenizer, chat template, and model config Every ollama pull downloads a GGUF file under the hood. But the real power move is downloading GGUF files directly from Hugging Face and importing them yourself. Quantization Analogy Steal This Quantization is like JPEG compression for AI models. A RAW photo is 50MB. A JPEG of the same photo is 5MB — 90% smaller, but it still looks 95% as good. That's what Q4 K M quantization does to a model: 70% smaller, 96% of the intelligence. Step 1: Finding the Right GGUF File The Golden Rule Always look for Q4 K M — it's the sweet spot of size vs quality for almost every model. Where to Find GGUFs | Source | URL | Best For | |---|---|---| Official provider | huggingface.co/Qwen etc. | Trustworthy, but often only Q8/Q6 | Unsloth | huggingface.co/unsloth | Best selection of quants Q2-Q8 | Bartowski | huggingface.co/bartowski | Massive library, every quantization | MaziyarPanahi | huggingface.co/MaziyarPanahi | Merged models, niche architectures | The GGUF Filename Decoder Qwen2.5-14B-Q4 K M.gguf ├── Model name ├── Size └── Quantization | Quant Code | Compression | Quality | Use Case | |---|---|---|---| | Q8 0 | 50% | 99% | When you have VRAM to spare | | Q6 K | 60% | 98% | High-quality, reasonable size | Q4 K M | 70% | 96% | 🟢 Sweet spot — use this | | Q3 K M | 78% | 92% | When VRAM is tight | | Q2 K | 85% | 85% | Emergency only — quality noticeably drops | | IQ4 XS | 72% | 95% | Experimental import format | Step 2: Download & Import a GGUF Basic Import 1. Download Q4 K M of Qwen 2.5-14B wget https://huggingface.co/bartowski/Qwen2.5-14B-GGUF/resolve/main/Qwen2.5-14B-Q4 K M.gguf 2. Create a Modelfile cat Modelfile << 'EOF' FROM ./Qwen2.5-14B-Q4 K M.gguf EOF 3. Import into Ollama ollama create my-custom-model -f Modelfile 4. Run it ollama run my-custom-model Smart Import with Optimized Settings cat Modelfile << 'EOF' FROM ./DeepSeek-R1-14B-Q4 K M.gguf Performance tuning PARAMETER num ctx 32768 PARAMETER num gpu layers 999 PARAMETER num thread 8 PARAMETER numa true Generation PARAMETER temperature 0.7 PARAMETER top p 0.9 PARAMETER repeat penalty 1.1 Chat template CRITICAL — must match the model TEMPLATE """{{ if .System }}<|im start| system {{ .System }}<|im end| {{ end }}<|im start| user {{ .Prompt }}<|im end| <|im start| assistant """ System prompt SYSTEM """You are a helpful AI assistant.""" EOF ollama create my-r1-custom -f Modelfile ollama run my-r1-custom Step 3: Modelfile Reference A Modelfile is like a Dockerfile for LLMs . Every line is an instruction. Parameters Reference | Parameter | What It Does | Default | Recommended Range | |---|---|---|---| temperature | Creativity level | 0.8 | 0.2 code – 1.0 creative | top p | Nucleus sampling | 0.9 | 0.85 – 0.95 | top k | Top-K sampling | 40 | 20 – 100 | num ctx | Context window size | 2048 | 4096 – 65536 | num gpu | GPU layers | 0 auto | 999 use all VRAM | num thread | CPU threads | auto | 4 – 16 | repeat penalty | Penalize repetition | 1.1 | 1.0 – 1.2 | stop | Stop sequences | varies | < | INSTRUCTION vs SYSTEM vs TEMPLATE {% raw %} SYSTEM: Persistent system prompt like OpenAI's system message SYSTEM """You are a helpful assistant.""" TEMPLATE: How user messages are formatted TEMPLATE """User: {{ .Prompt }} Assistant: """ INSTRUCTION: Model-specific instruction format rarely needed INSTRUCTION """Follow the user's instructions carefully.""" Three Production Configs 1. Coding Assistant FROM qwen2.5:7b PARAMETER temperature 0.2 PARAMETER top p 0.85 PARAMETER num ctx 65536 PARAMETER repeat penalty 1.1 SYSTEM """You are an expert Python developer. Write clean, tested code.""" 2. Creative Writer FROM mistral PARAMETER temperature 1.0 PARAMETER top p 0.95 PARAMETER num ctx 16384 SYSTEM """You are a novelist. Be vivid and descriptive.""" 3. Customer Support FROM llama4 PARAMETER temperature 0.5 PARAMETER top p 0.9 PARAMETER num ctx 8192 SYSTEM """You are a helpful customer support agent. Be polite, concise, and solution-oriented. NEVER mention that you are an AI.""" Step 4: Advanced Techniques 4.1 Multi-GPU Setup FROM deepseek-r1:70b Distribute across 2 GPUs PARAMETER num gpu layers 999 PARAMETER main gpu 0 PARAMETER tensor split "0.5,0.5" 4.2 LoRA Adapters Experimental Some Ollama builds support LoRA adapters: FROM base-model ADAPTER ./my-finetune-lora.gguf PARAMETER temperature 0.7 4.3 Custom Stop Tokens DeepSeek-R1 and Qwen use different stop tokens: For Qwen TEMPLATE """<|im start| user {{ .Prompt }}<|im end| <|im start| assistant """ PARAMETER stop "<|im end| " PARAMETER stop "<|im start| " For DeepSeek TEMPLATE """User: {{ .Prompt }} Assistant: """ PARAMETER stop "User:" 4.4 Emergency: VRAM Too Low If you get "CUDA out of memory": Force CPU for some layers PARAMETER num gpu layers 24 Only put 24 layers on GPU PARAMETER num thread 8 Use 8 CPU threads for the rest Step 5: GGUF from Ollama Models Export You can also export a model from Ollama back to a GGUF file: Save a model as GGUF ollama pull qwen2.5:7b ollama export qwen2.5:7b ./my-export.gguf Now you can use it anywhere llama.cpp, text-generation-webui, etc. ./llama-cli -m ./my-export.gguf -p "Hello" This is useful for: - Moving models between machines without re-downloading - Using the same model with multiple inference engines - Sharing a specific quantization with teammates Performance Cheat Sheet By GPU | GPU | VRAM | Best GGUF Model | Expected Speed | |---|---|---|---| | RTX 3060 / 4060 | 12 GB | Qwen 2.5-14B Q4 K M | 30-40 tok/s | | RTX 4070 / 5070 | 12 GB | Qwen 2.5-14B Q4 K M | 35-50 tok/s | | RTX 4080 / 5080 | 16 GB | DeepSeek-R1-14B Q4 K M | 30-45 tok/s | | RTX 4090 / 5090 | 24 GB | DeepSeek-R1-32B Q4 K M | 18-25 tok/s | | Mac M2 Pro | 16 GB | Qwen 2.5-7B Q4 K M | 15-25 tok/s | | Mac M4 Max | 36 GB | Qwen 3.6-27B Q4 K M | 20-30 tok/s | CPU-Only Performance | Model | Quant | RAM | Speed | |---|---|---|---| | Qwen 2.5-1.5B | Q4 K M | 4 GB | 8-15 tok/s | | Qwen 2.5-7B | Q4 K M | 16 GB | 1-4 tok/s | | Qwen 2.5-7B | Q2 K | 8 GB | 2-6 tok/s | Common Pitfalls | Problem | Cause | Fix | |---|---|---| | "Model not found" after import | Modelfile path is wrong | Use absolute path: FROM /home/user/model.gguf | | Gibberish output | Wrong chat template | The TEMPLATE line must match the model's expected format | | Slow generation | Running on CPU | PARAMETER num gpu layers 999 | | CUDA out of memory | Quantization too large for VRAM | Try smaller quant Q3 K M instead of Q4 K M | | Import errors | Corrupt GGUF download | Re-download and verify checksum | | Temperature not working | Set in Modelfile but overridden in API | Use the same temp in both places | | Chinese text output | Wrong template or default system prompt | Add PARAMETER stop "< | The tl;dr - Download: {% raw %} wget <huggingface-url /Model-Q4 K M.gguf - Create Modelfile: FROM ./Model.gguf + your settings - Import: ollama create my-model -f Modelfile - Run: ollama run my-model - Profit: Free, private, local AI Part of the Local LLM Guide — the definitive resource for running AI on your own hardware.