What I shipped during I/O 2026 week: Gemma 4 on Ollama with a five-piece safety stack

The author successfully ran Google's new Gemma 4 open models (2B and 4B parameters) on a laptop using Ollama during Google I/O 2026 week. To make the small 2B model function as a useful research agent, they built a five-piece safety stack addressing common failure modes, including JSON repair, tool call validation, domain allowlisting, anchored chat truncation, and snapshot testing. The resulting agent is approximately 200 lines of code, demonstrating that with proper scaffolding, a 2B-parameter model can be a practical tool without requiring API keys or rate limits.

Drafted in anticipation of the Google I/O 2026 Writing Challenge. Will add the devchallenge and challenge-specific tags once the announcement is live on May 19. I/O week is the one week of the year where my GitHub feed and my coffee maker are equally caffeinated. This year the announcement I cared about most was Gemma 4, the new family of open models from Google. By Friday I had gemma4:2b and gemma4:4b running on a laptop via Ollama, a small research agent loop, and a handful of tiny libraries I had been meaning to ship anyway. Here is what I shipped, why I shipped each piece, and what I learned about running a 2B-parameter open model as the brain of a real agent. Pointing a 2B-parameter model at "answer this question, use tools when you need to, return JSON" goes badly without scaffolding. The model wraps JSON in markdown fences. It hallucinates tool args. It drops a required field. Each of those is a separate failure mode, and each one has a clean, small fix. The agent I built is around 200 lines of code. The five libraries it depends on are around 200 lines each. Total surface area is small enough that I can hold the whole thing in my head and stick a debugger into any of it. Concretely, on the question "What is RLHF?", the agent: gemma4:2b returns a structured plan .fetch url to read the Wikipedia page .Steps 4, 5, and 6 are the load-bearing ones. Without them the 2B model is a toy. With them it is a useful tool that runs on a laptop with no API key and no rate limit. Gemma 4 will return JSON wrapped in json ... , sometimes with a trailing comma, sometimes prefixed with "Sure, here you go:". I built a three-pass repair: strip fences, extract the largest balanced JSON object, remove trailing commas. Then validate against a schema. If validation still fails, hand the model a short hint and retry once. The hint is the trick. Small models self-correct beautifully when you tell them precisely what was wrong. They do not self-correct on "invalid JSON, please try again." Give them the field name and the constraint. When the 2B model picks a tool, it sometimes picks args that violate the schema you sent it. The fix is to validate every tool call before running it. If validation fails, do not run the tool. Feed the validation issues back to the model as the tool's response. The model gets exactly the structural complaint it needs to correct on the next turn. This catches three classes of bugs: wrong types string where number was wanted , missing required fields, and extra fields the schema does not permit. All of them happen with smaller models. None of them happen if you validate first. Once the model can pick URLs to fetch, you have handed it URL-picking power. That is not always what you want. A naive prompt-injection attack can convince a small model to fetch from an attacker-controlled domain. The fix is the smallest piece in the stack: a declarative domain allowlist. Set it to the three or four hosts the agent legitimately needs. Block everything else with an actionable error. The model never gets to wander. Gemma 4 advertises 128k tokens but the practical throughput window is much smaller. Bounded chat histories matter. The fix is anchored truncation: always preserve the system message at the top and the trailing user turn at the bottom. Drop the middle when the total goes over budget. DropOldest is the right default. DropMiddle is a reasonable alternative if you want to keep both early grounding context and recent turns. Both keep the load-bearing pieces of the prompt. You tweak a system prompt. The agent picks a different tool order. Sometimes that is fine. Sometimes it is a regression that breaks the deployed app. By Friday afternoon. The fix is a snapshot test. Record one agent run end-to-end as a JSON trace. First test run writes the snapshot. Every subsequent run compares against the snapshot and fails with a unified diff if anything diverges. Refresh the snapshot when the change is intentional. Five lines per test. The pitch for big closed models is that they hide all of these problems for you. The pitch for small open models is everything else: latency, cost, privacy, offline-ness, the ability to fine-tune. The five problems above are the price of admission for the latter. The good news is that each problem is small and each fix is small. The combined scaffolding is around 1000 lines of code, MIT-licensed, distributed as separate libraries you can adopt one at a time. You can swap any one of them for your own implementation without the rest noticing. let messages = build messages question ; let fitted = Fitter::new 8 000 .fit messages, Strategy::DropOldest ; let raw = call gemma4 via ollama &fitted, &tap .await?; let action = action caster.parse &raw ?; if action.kind == "tool" { let v = tool validator &action.tool ?; v.validate &action.args .map err |e| anyhow::anyhow e.for llm ?; // Egress allowlist before any fetch if let Some url = action.args.get "url" { allow.check url.as str .unwrap ?; } run tool &action .await } else { Ok action.text } That is the whole thing. The 2B model on the other end. Five small libraries doing the boring work. The result is reliable enough for me to dogfood on local tasks without worrying about the model going off the rails. Two things, mostly. Open is having a moment, but only with scaffolding. Gemma 4 2B running locally is a real productivity tool once the safety net is in place. Without the safety net it is a demo that breaks the first time a user asks something weird. The community has been quietly building the safety net for a year; pick it up off the shelf. Local-first lowers the threshold for trying things. I built and tested the loop above without a single API call to a paid endpoint. The whole iteration cycle was free. The thing that would have been three weeks of work on a paid model was three nights of work on Ollama. If you build something on Gemma 4 this week, the meta-lesson is: do not be afraid to scaffold around it. The model is the easy part. The scaffolding is the part that ships. Happy I/O week.