Run Gemma-4 12B on WSL2 with llama.cpp

A developer has published a guide for running Google's Gemma-4 12B instruction-tuned model on Windows Subsystem for Linux 2 (WSL2) using the llama.cpp framework. The process involves installing build tools, the NVIDIA CUDA toolkit for GPU acceleration, and compiling llama.cpp with CUDA support before loading the model from Hugging Face. The setup achieves approximately 19.5 tokens per second for prompt processing and 11.8 tokens per second for generation on compatible hardware.

sudo apt update && sudo apt upgrade -y If you don't use -hf option, you don't need to install libssl-dev in this step. sudo apt install build-essential cmake git libssl-dev -y If nvidia-smi shows a GPU/GPUs on your terminal, you will need to install the tooklit. This will take some time. sudo apt install nvidia-cuda-toolkit -y Build llama-cli and llama-server. This step also will take some time. If you don't plan to use -hf option, you don't need to use -DLLAMA OPENSSL=ON . git clone https://github.com/ggerganov/llama.cpp cd llama.cpp cmake -B build -DGGML CUDA=ON -DLLAMA OPENSSL=ON cmake --build build --config Release no GPU git clone https://github.com/ggerganov/llama.cpp cd llama.cpp cmake -B build cmake --build build --config Release Run gemma-4-12b-it with cli and server. ./build/bin/llama-cli -hf unsloth/gemma-4-12b-it-GGUF:UD-Q4 K XL hello Start thinking The user said "hello". The user is initiating a conversation. Respond politely and offer assistance. "Hello How can I help you today?" "Hi there What's on your mind?" "Hello Is there anything I can assist you with?" End thinking Hello How can I help you today? Prompt: 19.5 t/s | Generation: 11.8 t/s or run web-ui ./build/bin/llama-server -hf unsloth/gemma-4-12b-it-GGUF:UD-Q4 K XL --port 8080 mkdir -p models wget -O models/gemma-4-12b-it-UD-Q4 K XL.gguf https://huggingface.co/unsloth/gemma-4-12b-it-GGUF/resolve/main/gemma-4-12b-it-UD-Q4 K XL.gguf