gemma4-safe-agent: a tool-using research agent on Gemma 4 e2b

Tool-using research agent built for the Gemma 4 DEV Challenge, which runs locally on the Gemma 4 e2b model via Ollama using roughly 200 lines of Node.js code. The agent accepts a question, selects between two tools to read a Wikipedia page, and returns a structured JSON answer with sources, requiring only 2 GB of RAM and no API key. The author notes that while the small Gemma 4 e2b model performs well at tool selection, the final answer step requires a "cast" loop to reliably produce clean JSON output.

Submission for the Gemma 4 DEV Challenge , Build track. Companion to my Write-track post on the five libs behind it . What it is A tool-using research agent that runs locally on Gemma 4 e2b via Ollama, in around 200 lines of Node. You give it a question. It picks between two tools, reads a Wikipedia page, then returns a structured JSON answer with sources. No API key. No rate limit. Two GB of RAM and an Ollama instance is the whole stack. ollama pull gemma4:e2b git clone https://github.com/MukundaKatta/gemma4-safe-agent cd gemma4-safe-agent && npm install npm run demo -- "What is RLHF?" { "final": "RLHF is a technique that uses human preferences as a reward signal to fine-tune language models.", "sources": "https://en.wikipedia.org/wiki/Reinforcement learning from human feedback" , "steps": 2 } Repo: github.com/MukundaKatta/gemma4-safe-agent https://github.com/MukundaKatta/gemma4-safe-agent Why Gemma 4 e2b specifically Gemma 4 ships in four sizes: e2b and e4b for edge and mobile, a 26B Mixture-of-Experts model, and a 31B dense model for servers. I picked e2b on purpose. Reasons: - Runs anywhere. Two GB of RAM, no network, no key. The agent works on a CI runner, a Raspberry Pi, an old MacBook. The bigger sizes do not. - Hardest reliability case. A 2B-class model makes more parse mistakes and more arg mistakes than a 26B. If the scaffolding holds at the 2B level, the bigger ones are a drop-in via GEMMA MODEL=gemma4:e4b . - Real product surface. Cheap, fast, local agents are where on-device AI is going. e2b is the right target for the kind of agent you'd actually ship in a desktop app, a mobile shell, or a browser extension. The same agent runs against any of the four Gemma 4 variants with one env var change. How it works The whole agent is a small loop: js for let step = 0; step < MAX STEPS; step++ { const fitted = fit messages, { maxTokens: 4096, preserveSystem: true, preserveLastN: 2 } ; const raw = await ollamaChat fitted.messages ; const action = parseAction raw ; if action.kind === 'tool' { const result = await TOOLS action.tool .fn action.args ; messages.push { role: 'assistant', content: raw } ; messages.push { role: 'user', content: tool result: ${result} } ; continue; } return cast { llm, validate, prompt: 'Restate as JSON: ...' } ; } The whole run is wrapped in an agentguard.firewall block. Each tool is wrapped with agentvet.vet and agentsnap.traceTool . That gives me: - Context budget management so Gemma 4 e2b never blows its small window - Network egress allowlist so a prompt injection cannot redirect the agent to fetch an attacker URL - Tool-arg validation so a hallucinated fetch url { url: 12345 } never runs - Trace snapshots so swapping models or tweaking prompts shows up as a CI diff, not a production surprise - Final-answer JSON enforcement with a validate-and-retry loop, which is the load-bearing piece for getting clean JSON out of a 2B model I wrote about the scaffolding in detail in the Write-track companion post https://dev.to/mukundakatta/making-gemma-4-e2b-production-safe-with-five-tiny-libraries-59k4 . Here the focus is the agent and the demo. What you can run The repo ships three entry points: - npm run demo -- "..." : real run against your local Gemma 4 e2b - npm run demo:mock : same agent, with fetch url returning canned pages no internet needed - AGENT MOCK=1 node examples/run-stub.js : deterministic stub LLM in place of Gemma 4, so the whole pipeline runs in CI without any model at all The third one is the one I use for snapshot regression tests. It proves the agent's tool-use behavior is stable even with an LLM swapped out. What surprised me Two things. Gemma 4 e2b picks the right tool more often than I expected. The model is small but the tool-selection task is well-bounded "you have these two tools, here's the schema, return one JSON" . When the surrounding scaffolding catches arg mistakes and JSON glitches, the model's reasoning is the part that doesn't need help. The final-answer step is where the model really needs the cast loop. Asking for "JSON only, no prose" still produced Sure here you go: {...} enough of the time that I would not trust the agent without agentcast wrapping that step. With it, the post-condition becomes a guarantee. Try it Repo: github.com/MukundaKatta/gemma4-safe-agent https://github.com/MukundaKatta/gemma4-safe-agent MIT Issues and PRs welcome. The five scaffolding libs are all on npm under @mukundakatta/ and are zero-dep, so you can pull them into your own Gemma 4 projects one at a time. If you build something on top of this, drop me a link. Have fun with Gemma 4.