{"slug": "gguf-modelfile-the-power-user-s-guide-to-local-llms", "title": "GGUF & Modelfile: The Power User's Guide to Local LLMs", "summary": "The article explains how power users can download GGUF (GPT-Generated Unified Format) model files directly from Hugging Face, quantize them (using Q4_K_M as the optimal balance of size and quality), and import them into Ollama using a Modelfile—a configuration file similar to a Dockerfile that allows customization of parameters like context length, temperature, and system prompts. It provides step-by-step instructions for creating custom models, including performance tuning, chat template formatting, and multi-GPU support, as well as troubleshooting tips for memory issues and exporting models back to GGUF format for use in other tools like llama.cpp.", "body_md": "# GGUF & Modelfile: The Power User's Guide to Local LLMs\n\nBeyond`ollama pull`\n\n— download any model from Hugging Face, quantize it, customize it, and import it into Ollama.\n\n## What's GGUF?\n\n**GGUF** (GPT-Generated Unified Format) is the standard file format for running LLMs locally. Think of it as the `.mp3`\n\nof AI models:\n\n-\n**Compressed**— 70-85% smaller than the original float16 weights -\n**Fast**— optimized for CPU and GPU inference -\n**Portable**— one file contains the entire model -\n**Metadata-rich**— includes tokenizer, chat template, and model config\n\nEvery `ollama pull`\n\ndownloads a GGUF file under the hood. But the real power move is downloading GGUF files directly from Hugging Face and importing them yourself.\n\n### Quantization Analogy (Steal This)\n\nQuantization is like\n\nJPEG compression for AI models. A RAW photo is 50MB. A JPEG of the same photo is 5MB — 90% smaller, but it still looks 95% as good. That's what Q4_K_M quantization does to a model: 70% smaller, 96% of the intelligence.\n\n## Step 1: Finding the Right GGUF File\n\n### The Golden Rule\n\n**Always look for Q4_K_M** — it's the sweet spot of size vs quality for almost every model.\n\n### Where to Find GGUFs\n\n| Source | URL | Best For |\n|---|---|---|\nOfficial provider |\n`huggingface.co/Qwen` etc. |\nTrustworthy, but often only Q8/Q6 |\nUnsloth |\n`huggingface.co/unsloth` |\nBest selection of quants (Q2-Q8) |\nBartowski |\n`huggingface.co/bartowski` |\nMassive library, every quantization |\nMaziyarPanahi |\n`huggingface.co/MaziyarPanahi` |\nMerged models, niche architectures |\n\n### The GGUF Filename Decoder\n\n```\nQwen2.5-14B-Q4_K_M.gguf\n├── Model name      ├── Size   └── Quantization\n```\n\n| Quant Code | Compression | Quality | Use Case |\n|---|---|---|---|\n| Q8_0 | 50% | 99% | When you have VRAM to spare |\n| Q6_K | 60% | 98% | High-quality, reasonable size |\nQ4_K_M |\n70% |\n96% |\n🟢 Sweet spot — use this |\n| Q3_K_M | 78% | 92% | When VRAM is tight |\n| Q2_K | 85% | 85% | Emergency only — quality noticeably drops |\n| IQ4_XS | 72% | 95% | Experimental import format |\n\n## Step 2: Download & Import a GGUF\n\n### Basic Import\n\n```\n# 1. Download Q4_K_M of Qwen 2.5-14B\nwget https://huggingface.co/bartowski/Qwen2.5-14B-GGUF/resolve/main/Qwen2.5-14B-Q4_K_M.gguf\n\n# 2. Create a Modelfile\ncat > Modelfile << 'EOF'\nFROM ./Qwen2.5-14B-Q4_K_M.gguf\nEOF\n\n# 3. Import into Ollama\nollama create my-custom-model -f Modelfile\n\n# 4. Run it\nollama run my-custom-model\n```\n\n### Smart Import (with Optimized Settings)\n\n```\ncat > Modelfile << 'EOF'\nFROM ./DeepSeek-R1-14B-Q4_K_M.gguf\n\n# Performance tuning\nPARAMETER num_ctx 32768\nPARAMETER num_gpu_layers 999\nPARAMETER num_thread 8\nPARAMETER numa true\n\n# Generation\nPARAMETER temperature 0.7\nPARAMETER top_p 0.9\nPARAMETER repeat_penalty 1.1\n\n# Chat template (CRITICAL — must match the model!)\nTEMPLATE \"\"\"{{ if .System }}<|im_start|>system\n{{ .System }}<|im_end|>\n{{ end }}<|im_start|>user\n{{ .Prompt }}<|im_end|>\n<|im_start|>assistant\n\"\"\"\n\n# System prompt\nSYSTEM \"\"\"You are a helpful AI assistant.\"\"\"\nEOF\n\nollama create my-r1-custom -f Modelfile\nollama run my-r1-custom\n```\n\n## Step 3: Modelfile Reference\n\nA Modelfile is like a **Dockerfile for LLMs**. Every line is an instruction.\n\n### Parameters Reference\n\n| Parameter | What It Does | Default | Recommended Range |\n|---|---|---|---|\n`temperature` |\nCreativity level | 0.8 | 0.2 (code) – 1.0 (creative) |\n`top_p` |\nNucleus sampling | 0.9 | 0.85 – 0.95 |\n`top_k` |\nTop-K sampling | 40 | 20 – 100 |\n`num_ctx` |\nContext window size | 2048 | 4096 – 65536 |\n`num_gpu` |\nGPU layers | 0 (auto) | 999 (use all VRAM) |\n`num_thread` |\nCPU threads | auto | 4 – 16 |\n`repeat_penalty` |\nPenalize repetition | 1.1 | 1.0 – 1.2 |\n`stop` |\nStop sequences | varies | `< |\n\n### INSTRUCTION vs SYSTEM vs TEMPLATE\n\n{% raw %}\n\n```\n# SYSTEM: Persistent system prompt (like OpenAI's system message)\nSYSTEM \"\"\"You are a helpful assistant.\"\"\"\n\n# TEMPLATE: How user messages are formatted\nTEMPLATE \"\"\"User: {{ .Prompt }}\nAssistant: \"\"\"\n\n# INSTRUCTION: Model-specific instruction format (rarely needed)\nINSTRUCTION \"\"\"Follow the user's instructions carefully.\"\"\"\n```\n\n### Three Production Configs\n\n**1. Coding Assistant**\n\n```\nFROM qwen2.5:7b\nPARAMETER temperature 0.2\nPARAMETER top_p 0.85\nPARAMETER num_ctx 65536\nPARAMETER repeat_penalty 1.1\nSYSTEM \"\"\"You are an expert Python developer. Write clean, tested code.\"\"\"\n```\n\n**2. Creative Writer**\n\n```\nFROM mistral\nPARAMETER temperature 1.0\nPARAMETER top_p 0.95\nPARAMETER num_ctx 16384\nSYSTEM \"\"\"You are a novelist. Be vivid and descriptive.\"\"\"\n```\n\n**3. Customer Support**\n\n```\nFROM llama4\nPARAMETER temperature 0.5\nPARAMETER top_p 0.9\nPARAMETER num_ctx 8192\nSYSTEM \"\"\"You are a helpful customer support agent.\nBe polite, concise, and solution-oriented.\nNEVER mention that you are an AI.\"\"\"\n```\n\n## Step 4: Advanced Techniques\n\n### 4.1 Multi-GPU Setup\n\n```\nFROM deepseek-r1:70b\n\n# Distribute across 2 GPUs\nPARAMETER num_gpu_layers 999\nPARAMETER main_gpu 0\nPARAMETER tensor_split \"0.5,0.5\"\n```\n\n### 4.2 LoRA Adapters (Experimental)\n\nSome Ollama builds support LoRA adapters:\n\n```\nFROM base-model\nADAPTER ./my-finetune-lora.gguf\nPARAMETER temperature 0.7\n```\n\n### 4.3 Custom Stop Tokens\n\nDeepSeek-R1 and Qwen use different stop tokens:\n\n```\n# For Qwen\nTEMPLATE \"\"\"<|im_start|>user\n{{ .Prompt }}<|im_end|>\n<|im_start|>assistant\n\"\"\"\nPARAMETER stop \"<|im_end|>\"\nPARAMETER stop \"<|im_start|>\"\n\n# For DeepSeek\nTEMPLATE \"\"\"User: {{ .Prompt }}\nAssistant: \"\"\"\nPARAMETER stop \"User:\"\n```\n\n### 4.4 Emergency: VRAM Too Low\n\nIf you get \"CUDA out of memory\":\n\n```\n# Force CPU for some layers\nPARAMETER num_gpu_layers 24  # Only put 24 layers on GPU\nPARAMETER num_thread 8       # Use 8 CPU threads for the rest\n```\n\n## Step 5: GGUF from Ollama Models (Export)\n\nYou can also **export** a model from Ollama back to a GGUF file:\n\n```\n# Save a model as GGUF\nollama pull qwen2.5:7b\nollama export qwen2.5:7b ./my-export.gguf\n\n# Now you can use it anywhere (llama.cpp, text-generation-webui, etc.)\n./llama-cli -m ./my-export.gguf -p \"Hello\"\n```\n\nThis is useful for:\n\n- Moving models between machines without re-downloading\n- Using the same model with multiple inference engines\n- Sharing a specific quantization with teammates\n\n## Performance Cheat Sheet\n\n### By GPU\n\n| GPU | VRAM | Best GGUF Model | Expected Speed |\n|---|---|---|---|\n| RTX 3060 / 4060 | 12 GB | Qwen 2.5-14B (Q4_K_M) | 30-40 tok/s |\n| RTX 4070 / 5070 | 12 GB | Qwen 2.5-14B (Q4_K_M) | 35-50 tok/s |\n| RTX 4080 / 5080 | 16 GB | DeepSeek-R1-14B (Q4_K_M) | 30-45 tok/s |\n| RTX 4090 / 5090 | 24 GB | DeepSeek-R1-32B (Q4_K_M) | 18-25 tok/s |\n| Mac M2 Pro | 16 GB | Qwen 2.5-7B (Q4_K_M) | 15-25 tok/s |\n| Mac M4 Max | 36 GB | Qwen 3.6-27B (Q4_K_M) | 20-30 tok/s |\n\n### CPU-Only Performance\n\n| Model | Quant | RAM | Speed |\n|---|---|---|---|\n| Qwen 2.5-1.5B | Q4_K_M | 4 GB | 8-15 tok/s |\n| Qwen 2.5-7B | Q4_K_M | 16 GB | 1-4 tok/s |\n| Qwen 2.5-7B | Q2_K | 8 GB | 2-6 tok/s |\n\n## Common Pitfalls\n\n| Problem | Cause | Fix |\n|---|---|---|\n| \"Model not found\" after import | Modelfile path is wrong | Use absolute path: `FROM /home/user/model.gguf`\n|\n| Gibberish output | Wrong chat template | The TEMPLATE line must match the model's expected format |\n| Slow generation | Running on CPU | `PARAMETER num_gpu_layers 999` |\n| CUDA out of memory | Quantization too large for VRAM | Try smaller quant (Q3_K_M instead of Q4_K_M) |\n| Import errors | Corrupt GGUF download | Re-download and verify checksum |\n| Temperature not working | Set in Modelfile but overridden in API | Use the same temp in both places |\n| Chinese text output | Wrong template or default system prompt | Add `PARAMETER stop \"< |\n\n## The tl;dr\n\n-\n**Download:**{% raw %}`wget <huggingface-url>/Model-Q4_K_M.gguf`\n\n-\n**Create Modelfile:**`FROM ./Model.gguf`\n\n+ your settings -\n**Import:**`ollama create my-model -f Modelfile`\n\n-\n**Run:**`ollama run my-model`\n\n-\n**Profit:** Free, private, local AI\n\n*Part of the Local LLM Guide — the definitive resource for running AI on your own hardware.*", "url": "https://wpnews.pro/news/gguf-modelfile-the-power-user-s-guide-to-local-llms", "canonical_source": "https://dev.to/lingdas1/gguf-modelfile-the-power-users-guide-to-local-llms-1fbi", "published_at": "2026-05-23 18:48:58+00:00", "updated_at": "2026-05-23 19:04:07.944290+00:00", "lang": "en", "topics": ["large-language-models", "open-source", "developer-tools", "artificial-intelligence", "machine-learning"], "entities": ["GGUF", "Ollama", "Hugging Face", "Qwen2.5-14B", "DeepSeek-R1-14B", "bartowski"], "alternates": {"html": "https://wpnews.pro/news/gguf-modelfile-the-power-user-s-guide-to-local-llms", "markdown": "https://wpnews.pro/news/gguf-modelfile-the-power-user-s-guide-to-local-llms.md", "text": "https://wpnews.pro/news/gguf-modelfile-the-power-user-s-guide-to-local-llms.txt", "jsonld": "https://wpnews.pro/news/gguf-modelfile-the-power-user-s-guide-to-local-llms.jsonld"}}