# Run Gemma-4 E2B-it with llama.cpp on Raspberry Pi4

> Source: <https://dev.to/0xkoji/run-gemma-4-e2b-it-with-llamacpp-on-raspberry-pi4-3a1m>
> Published: 2026-05-31 02:19:42+00:00

Tested Gemma-4 E2B-it on Raspberry Pi 4.

the way to convert Gemma-4 E2B-it to gguf

models

[https://huggingface.co/baxin/gemma-4-E4B-it-E2B-it-Q4_K_M](https://huggingface.co/baxin/gemma-4-E4B-it-E2B-it-Q4_K_M)

LLM inference in C/C++

`-hf`

are now stored in the standard Hugging Face cache directory, enabling sharing with other HF tools.`gpt-oss`

model with native MXFP4 format has been added | `llama-server`

: 

```
git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
cmake -B build -DCMAKE_BUILD_TYPE=Release

cmake --build build --config Release
```

the command was run from `llama.cpp`

folder and `gemma-4-E2B-it-Q4_K_M.gguf`

is placed in `models`

folder.

`folder structure`

```
llama.cpp   models
./build/bin/llama-cli   -m ../models/gemma-4-E2B-it-Q4_K_M.gguf   -t 4   -tb 4   -c 2048   -fa auto   --prio 3   -p "hello"

▄▄ ▄▄
██ ██
██ ██  ▀▀█▄ ███▄███▄  ▀▀█▄    ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██    ██    ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
                                    ██    ██
                                    ▀▀    ▀▀

build      : b9425-0821c5fcf
model      : gemma-4-E2B-it-Q4_K_M.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read <file>        add a text file
  /glob <pattern>     add text files using globbing pattern

> hello

[Start thinking]
Thinking Process:

1.  **Analyze the input:** The input is "hello".
2.  **Determine the context/intent:** This is a standard social greeting.
3.  **Formulate an appropriate response:** The response should be friendly, polite, and acknowledge the greeting. Standard responses include reciprocating the greeting and offering further interaction (e.g., asking how the user is or offering assistance).
4.  **Refine the response:** Keep it open-ended and welcoming.

*Self-Correction/Refinement:* A simple "hello" back is fine, but adding a follow-up makes the interaction more engaging.

5.  **Final Output Generation.**
[End thinking]

Hello! How can I help you today?

[ Prompt: 1.3 t/s | Generation: 1.8 t/s ]
```

`clang`

```
sudo apt install -y clang
rm -rf build
cmake -B build -DCMAKE_BUILD_TYPE=Release \
  -DLLAMA_NATIVE=ON \
  -DLLAMA_ARM_NEON=ON

cmake --build build --config Release -j
./build/bin/llama-cli   -m ../models/gemma-4-E2B-it-Q4_K_M.gguf   -t 4   -tb 4   -c 2048   -fa auto   --prio 3   -p "hello"
▄▄ ▄▄
██ ██
██ ██  ▀▀█▄ ███▄███▄  ▀▀█▄    ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██    ██    ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
                                    ██    ██
                                    ▀▀    ▀▀

build      : b9425-0821c5fcf
model      : gemma-4-E2B-it-Q4_K_M.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read <file>        add a text file
  /glob <pattern>     add text files using globbing pattern

> hello

[Start thinking]
Thinking Process:

1.  **Analyze the input:** The input is "hello".
2.  **Determine the context:** This is a simple, friendly greeting.
3.  **Formulate the response goal:** The response should be equally friendly, polite, and open-ended (inviting further conversation).
4.  **Draft potential responses:**
    *   "Hello!" (Too brief, but fine.)
    *   "Hi there." (Friendly.)
    *   "Hello! How can I help you today?" (Polite, proactive.)
    *   "Hello! What can I do for you?" (Direct, service-oriented.)
5.  **Select the best response:** A standard friendly greeting followed by an invitation to continue the interaction is usually best.

6.  **Final Output Generation.**
[End thinking]

Hello! How can I help you today?

[ Prompt: 2.4 t/s | Generation: 1.5 t/s ]
```

Prompt ↗️ but Generation ↘️

Unfortunately, it doesn't work for an agent.

Also tried to run LiquidAI/LFM2.5-8B-A1B-GGUF

The result was Prompt: 0.3 t/s | Generation: 0.5t/s ↘️

Raspberry Pi 5 costs around $305, so if you want to run an LLM with fewer than 10B parameters, it seems better to buy a mini PC with 16GB RAM in the $300–400 range.
