Run Gemma-4 E2B-it with llama.cpp on Raspberry Pi4

wpnews.pro

cd /news/large-language-models/run-gemma-4-e2b-it-with-llama-cpp-on… · home › topics › large-language-models › article

[ARTICLE · art-18917] src=dev.to ↗ pub=2026-05-31T02:19Z topic=large-language-models verified=true sentiment=· neutral

Run Gemma-4 E2B-it with llama.cpp on Raspberry Pi4

A developer successfully ran Google's Gemma-4 E2B-it large language model on a Raspberry Pi 4 using llama.cpp, achieving text generation speeds of 1.5 to 1.8 tokens per second. The project involved converting the model to GGUF format and compiling llama.cpp with Clang and ARM NEON optimizations for the ARM-based single-board computer.

read3 min views22 publishedMay 31, 2026

Tested Gemma-4 E2B-it on Raspberry Pi 4.

the way to convert Gemma-4 E2B-it to gguf

models

https://huggingface.co/baxin/gemma-4-E4B-it-E2B-it-Q4_K_M

LLM inference in C/C++

-hf

are now stored in the standard Hugging Face cache directory, enabling sharing with other HF tools.gpt-oss

model with native MXFP4 format has been added | llama-server

git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
cmake -B build -DCMAKE_BUILD_TYPE=Release

cmake --build build --config Release

the command was run from llama.cpp

folder and gemma-4-E2B-it-Q4_K_M.gguf

is placed in models

folder.

folder structure

llama.cpp   models
./build/bin/llama-cli   -m ../models/gemma-4-E2B-it-Q4_K_M.gguf   -t 4   -tb 4   -c 2048   -fa auto   --prio 3   -p "hello"

▄▄ ▄▄
██ ██
██ ██  ▀▀█▄ ███▄███▄  ▀▀█▄    ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██    ██    ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
                                    ██    ██
                                    ▀▀    ▀▀

build      : b9425-0821c5fcf
model      : gemma-4-E2B-it-Q4_K_M.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read <file>        add a text file
  /glob <pattern>     add text files using globbing pattern

> hello

[Start thinking]
Thinking Process:

1.  **Analyze the input:** The input is "hello".
2.  **Determine the context/intent:** This is a standard social greeting.
3.  **Formulate an appropriate response:** The response should be friendly, polite, and acknowledge the greeting. Standard responses include reciprocating the greeting and offering further interaction (e.g., asking how the user is or offering assistance).
4.  **Refine the response:** Keep it open-ended and welcoming.

*Self-Correction/Refinement:* A simple "hello" back is fine, but adding a follow-up makes the interaction more engaging.

5.  **Final Output Generation.**
[End thinking]

Hello! How can I help you today?

[ Prompt: 1.3 t/s | Generation: 1.8 t/s ]

clang

sudo apt install -y clang
rm -rf build
cmake -B build -DCMAKE_BUILD_TYPE=Release \
  -DLLAMA_NATIVE=ON \
  -DLLAMA_ARM_NEON=ON

cmake --build build --config Release -j
./build/bin/llama-cli   -m ../models/gemma-4-E2B-it-Q4_K_M.gguf   -t 4   -tb 4   -c 2048   -fa auto   --prio 3   -p "hello"
▄▄ ▄▄
██ ██
██ ██  ▀▀█▄ ███▄███▄  ▀▀█▄    ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██    ██    ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
                                    ██    ██
                                    ▀▀    ▀▀

build      : b9425-0821c5fcf
model      : gemma-4-E2B-it-Q4_K_M.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read <file>        add a text file
  /glob <pattern>     add text files using globbing pattern

> hello

[Start thinking]
Thinking Process:

1.  **Analyze the input:** The input is "hello".
2.  **Determine the context:** This is a simple, friendly greeting.
3.  **Formulate the response goal:** The response should be equally friendly, polite, and open-ended (inviting further conversation).
4.  **Draft potential responses:**
    *   "Hello!" (Too brief, but fine.)
    *   "Hi there." (Friendly.)
    *   "Hello! How can I help you today?" (Polite, proactive.)
    *   "Hello! What can I do for you?" (Direct, service-oriented.)
5.  **Select the best response:** A standard friendly greeting followed by an invitation to continue the interaction is usually best.

6.  **Final Output Generation.**
[End thinking]

Hello! How can I help you today?

[ Prompt: 2.4 t/s | Generation: 1.5 t/s ]

Prompt ↗️ but Generation ↘️

Unfortunately, it doesn't work for an agent.

Also tried to run LiquidAI/LFM2.5-8B-A1B-GGUF

The result was Prompt: 0.3 t/s | Generation: 0.5t/s ↘️

Raspberry Pi 5 costs around $305, so if you want to run an LLM with fewer than 10B parameters, it seems better to buy a mini PC with 16GB RAM in the $300–400 range.

source & further reading

dev.to — original article Building an anonymous AI photo editor without letting one visitor take the GPU I’m sick of AI “Thinkslop” in my PRs Background Agents: The Open-Source System That Lets AI Code While You Sleep (382K+ GitHub Stars)

~/api · this article 200

$curl api.wpnews.pro/v1/news/run-gemma-4-e2b-it-with-…

Read original on dev.to → dev.to/0xkoji/run-gemma-4-e2b-it-with-llamacpp-o…

mentioned entities

Gemma-4

llama.cpp

Raspberry Pi

Hugging Face

baxin

metadata

slugrun-gemma-4-e2b-it-with-llama-cpp-on-raspberry-pi4

topic#large-language-models

secondary3 topics

sentimentneutral

canonicaldev.to

navigation

← prevHermes Repo Dojo: Most Agents An…

next →Hermes Agent Changed How I Think…

── more in #large-language-models 4 stories · sorted by recency

github.com · 15 Jul · #large-language-models

Show HN: AI-CLI – tiny C terminal assistant powered by local LLM

github.com · 15 Jul · #large-language-models

Cicada- an agentic Python IDE Free to use ( comes with built in small model)

dev.to · 14 Jul · #large-language-models

Panduan Teknikal: Compile llama.cpp di Debian 12/13 dan Cross Compile ARM64

byteiota.com · 15 Jul · #large-language-models

NVIDIA Nemotron TwoTower: 2.42x Faster LLM Inference

── more on @gemma-4 3 stories trending now

wpnews · 27 May · #artificial-intelligence

How I Run Two Claude Accounts as One

wpnews · 23 May · #artificial-intelligence

AccessLens — a blind person's lanyard, powered by Gemma 4 on-device

wpnews · 21 May · #developer-tools

Antigravity CLI: A Hands-On Guide to Google's Terminal Coding Agent

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required