{"slug": "run-gemma-4-e2b-it-with-llama-cpp-on-raspberry-pi4", "title": "Run Gemma-4 E2B-it with llama.cpp on Raspberry Pi4", "summary": "A developer successfully ran Google's Gemma-4 E2B-it large language model on a Raspberry Pi 4 using llama.cpp, achieving text generation speeds of 1.5 to 1.8 tokens per second. The project involved converting the model to GGUF format and compiling llama.cpp with Clang and ARM NEON optimizations for the ARM-based single-board computer.", "body_md": "Tested Gemma-4 E2B-it on Raspberry Pi 4.\n\nthe way to convert Gemma-4 E2B-it to gguf\n\nmodels\n\n[https://huggingface.co/baxin/gemma-4-E4B-it-E2B-it-Q4_K_M](https://huggingface.co/baxin/gemma-4-E4B-it-E2B-it-Q4_K_M)\n\nLLM inference in C/C++\n\n`-hf`\n\nare now stored in the standard Hugging Face cache directory, enabling sharing with other HF tools.`gpt-oss`\n\nmodel with native MXFP4 format has been added | `llama-server`\n\n: \n\n```\ngit clone https://github.com/ggml-org/llama.cpp.git\ncd llama.cpp\ncmake -B build -DCMAKE_BUILD_TYPE=Release\n\ncmake --build build --config Release\n```\n\nthe command was run from `llama.cpp`\n\nfolder and `gemma-4-E2B-it-Q4_K_M.gguf`\n\nis placed in `models`\n\nfolder.\n\n`folder structure`\n\n```\nllama.cpp   models\n./build/bin/llama-cli   -m ../models/gemma-4-E2B-it-Q4_K_M.gguf   -t 4   -tb 4   -c 2048   -fa auto   --prio 3   -p \"hello\"\n\n▄▄ ▄▄\n██ ██\n██ ██  ▀▀█▄ ███▄███▄  ▀▀█▄    ▄████ ████▄ ████▄\n██ ██ ▄█▀██ ██ ██ ██ ▄█▀██    ██    ██ ██ ██ ██\n██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀\n                                    ██    ██\n                                    ▀▀    ▀▀\n\nbuild      : b9425-0821c5fcf\nmodel      : gemma-4-E2B-it-Q4_K_M.gguf\nmodalities : text\n\navailable commands:\n  /exit or Ctrl+C     stop or exit\n  /regen              regenerate the last response\n  /clear              clear the chat history\n  /read <file>        add a text file\n  /glob <pattern>     add text files using globbing pattern\n\n> hello\n\n[Start thinking]\nThinking Process:\n\n1.  **Analyze the input:** The input is \"hello\".\n2.  **Determine the context/intent:** This is a standard social greeting.\n3.  **Formulate an appropriate response:** The response should be friendly, polite, and acknowledge the greeting. Standard responses include reciprocating the greeting and offering further interaction (e.g., asking how the user is or offering assistance).\n4.  **Refine the response:** Keep it open-ended and welcoming.\n\n*Self-Correction/Refinement:* A simple \"hello\" back is fine, but adding a follow-up makes the interaction more engaging.\n\n5.  **Final Output Generation.**\n[End thinking]\n\nHello! How can I help you today?\n\n[ Prompt: 1.3 t/s | Generation: 1.8 t/s ]\n```\n\n`clang`\n\n```\nsudo apt install -y clang\nrm -rf build\ncmake -B build -DCMAKE_BUILD_TYPE=Release \\\n  -DLLAMA_NATIVE=ON \\\n  -DLLAMA_ARM_NEON=ON\n\ncmake --build build --config Release -j\n./build/bin/llama-cli   -m ../models/gemma-4-E2B-it-Q4_K_M.gguf   -t 4   -tb 4   -c 2048   -fa auto   --prio 3   -p \"hello\"\n▄▄ ▄▄\n██ ██\n██ ██  ▀▀█▄ ███▄███▄  ▀▀█▄    ▄████ ████▄ ████▄\n██ ██ ▄█▀██ ██ ██ ██ ▄█▀██    ██    ██ ██ ██ ██\n██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀\n                                    ██    ██\n                                    ▀▀    ▀▀\n\nbuild      : b9425-0821c5fcf\nmodel      : gemma-4-E2B-it-Q4_K_M.gguf\nmodalities : text\n\navailable commands:\n  /exit or Ctrl+C     stop or exit\n  /regen              regenerate the last response\n  /clear              clear the chat history\n  /read <file>        add a text file\n  /glob <pattern>     add text files using globbing pattern\n\n> hello\n\n[Start thinking]\nThinking Process:\n\n1.  **Analyze the input:** The input is \"hello\".\n2.  **Determine the context:** This is a simple, friendly greeting.\n3.  **Formulate the response goal:** The response should be equally friendly, polite, and open-ended (inviting further conversation).\n4.  **Draft potential responses:**\n    *   \"Hello!\" (Too brief, but fine.)\n    *   \"Hi there.\" (Friendly.)\n    *   \"Hello! How can I help you today?\" (Polite, proactive.)\n    *   \"Hello! What can I do for you?\" (Direct, service-oriented.)\n5.  **Select the best response:** A standard friendly greeting followed by an invitation to continue the interaction is usually best.\n\n6.  **Final Output Generation.**\n[End thinking]\n\nHello! How can I help you today?\n\n[ Prompt: 2.4 t/s | Generation: 1.5 t/s ]\n```\n\nPrompt ↗️ but Generation ↘️\n\nUnfortunately, it doesn't work for an agent.\n\nAlso tried to run LiquidAI/LFM2.5-8B-A1B-GGUF\n\nThe result was Prompt: 0.3 t/s | Generation: 0.5t/s ↘️\n\nRaspberry Pi 5 costs around $305, so if you want to run an LLM with fewer than 10B parameters, it seems better to buy a mini PC with 16GB RAM in the $300–400 range.", "url": "https://wpnews.pro/news/run-gemma-4-e2b-it-with-llama-cpp-on-raspberry-pi4", "canonical_source": "https://dev.to/0xkoji/run-gemma-4-e2b-it-with-llamacpp-on-raspberry-pi4-3a1m", "published_at": "2026-05-31 02:19:42+00:00", "updated_at": "2026-05-31 02:41:53.000153+00:00", "lang": "en", "topics": ["large-language-models", "artificial-intelligence", "ai-tools", "ai-infrastructure"], "entities": ["Gemma-4", "llama.cpp", "Raspberry Pi", "Hugging Face", "baxin"], "alternates": {"html": "https://wpnews.pro/news/run-gemma-4-e2b-it-with-llama-cpp-on-raspberry-pi4", "markdown": "https://wpnews.pro/news/run-gemma-4-e2b-it-with-llama-cpp-on-raspberry-pi4.md", "text": "https://wpnews.pro/news/run-gemma-4-e2b-it-with-llama-cpp-on-raspberry-pi4.txt", "jsonld": "https://wpnews.pro/news/run-gemma-4-e2b-it-with-llama-cpp-on-raspberry-pi4.jsonld"}}