{"slug": "run-gemma-4-12b-on-wsl2-with-llama-cpp", "title": "Run Gemma-4 12B on WSL2 with llama.cpp", "summary": "A developer has published a guide for running Google's Gemma-4 12B instruction-tuned model on Windows Subsystem for Linux 2 (WSL2) using the llama.cpp framework. The process involves installing build tools, the NVIDIA CUDA toolkit for GPU acceleration, and compiling llama.cpp with CUDA support before loading the model from Hugging Face. The setup achieves approximately 19.5 tokens per second for prompt processing and 11.8 tokens per second for generation on compatible hardware.", "body_md": "\n\n```\nsudo apt update && sudo apt upgrade -y\n```\n\nIf you don't use `-hf`\n\noption, you don't need to install libssl-dev in this step.\n\n```\nsudo apt install build-essential cmake git libssl-dev -y\n```\n\nIf `nvidia-smi`\n\nshows a GPU/GPUs on your terminal, you will need to install the tooklit. This will take some time.\n\n```\nsudo apt install nvidia-cuda-toolkit -y\n```\n\nBuild llama-cli and llama-server. This step also will take some time.\n\nIf you don't plan to use `-hf`\n\noption, you don't need to use `-DLLAMA_OPENSSL=ON`\n\n.\n\n```\ngit clone https://github.com/ggerganov/llama.cpp\ncd llama.cpp\ncmake -B build -DGGML_CUDA=ON -DLLAMA_OPENSSL=ON\ncmake --build build --config Release\n\n# no GPU\ngit clone https://github.com/ggerganov/llama.cpp\ncd llama.cpp\ncmake -B build\ncmake --build build --config Release\n```\n\nRun `gemma-4-12b-it`\n\nwith cli and server.\n\n```\n./build/bin/llama-cli -hf unsloth/gemma-4-12b-it-GGUF:UD-Q4_K_XL\n> hello\n\n[Start thinking]\nThe user said \"hello\".\nThe user is initiating a conversation.\nRespond politely and offer assistance.\n\n    *   \"Hello! How can I help you today?\"\n    *   \"Hi there! What's on your mind?\"\n    *   \"Hello! Is there anything I can assist you with?\"\n[End thinking]\n\nHello! How can I help you today?\n\n[ Prompt: 19.5 t/s | Generation: 11.8 t/s ]\n```\n\nor run `web-ui`\n\n```\n./build/bin/llama-server -hf unsloth/gemma-4-12b-it-GGUF:UD-Q4_K_XL --port 8080\nmkdir -p models\nwget -O models/gemma-4-12b-it-UD-Q4_K_XL.gguf https://huggingface.co/unsloth/gemma-4-12b-it-GGUF/resolve/main/gemma-4-12b-it-UD-Q4_K_XL.gguf\n```\n\n", "url": "https://wpnews.pro/news/run-gemma-4-12b-on-wsl2-with-llama-cpp", "canonical_source": "https://dev.to/0xkoji/run-gemma-4-12b-on-wsl2-with-llamacpp-1o2m", "published_at": "2026-06-06 03:22:37+00:00", "updated_at": "2026-06-06 03:42:16.864161+00:00", "lang": "en", "topics": ["large-language-models", "artificial-intelligence", "ai-tools", "ai-infrastructure", "generative-ai"], "entities": ["Gemma-4", "llama.cpp", "WSL2", "NVIDIA", "CUDA", "unsloth", "Hugging Face", "GGUF"], "alternates": {"html": "https://wpnews.pro/news/run-gemma-4-12b-on-wsl2-with-llama-cpp", "markdown": "https://wpnews.pro/news/run-gemma-4-12b-on-wsl2-with-llama-cpp.md", "text": "https://wpnews.pro/news/run-gemma-4-12b-on-wsl2-with-llama-cpp.txt", "jsonld": "https://wpnews.pro/news/run-gemma-4-12b-on-wsl2-with-llama-cpp.jsonld"}}