Easy Agentic Tool Calling with Gemma 4

Google's Gemma 4 model can now autonomously decide when to inspect its local filesystem and execute Python code, moving beyond simple web API calls toward true agentic behavior. The new implementation provides the model with a sandboxed filesystem explorer and restricted Python interpreter, enabling it to reason about its own environment and offload computational tasks. This development represents a shift from retrieval-augmented chatbots to models that can interact with and modify their local system.

Easy Agentic Tool Calling with Gemma 4 In this tutorial, we will give Gemma 4 two new tools and watch the model decide, on its own, when to look around and when to compute. Introduction In a recent article https://machinelearningmastery.com/how-to-implement-tool-calling-with-gemma-4-and-python/ on Machine Learning Mastery, we built a tool-calling agent that reached outward , that is pulling weather, news, currency rates, and time from public APIs. That article covered the synthesis half of the pattern nicely, but it left the more interesting half on the table: an agent that reasons about its own environment, inspects its own machine, and offloads logic it doesn't trust itself to perform. It could be argued that this is closer to truly "agentic." This article picks up where that one left off. We will give Gemma 4 two new tools — a sandboxed local filesystem explorer and a restricted Python interpreter — and watch the model decide, on its own, when to look around and when to compute. Topics we will cover include: - Why "agentic" tool calling needs more than web APIs to be interesting - How to build a filesystem inspection tool with hard path-traversal guards - How to wire a Python interpreter tool to the model without handing it the keys to your machine - How the same orchestration loop from before generalizes to these new capabilities I highly recommend that you first read this article https://machinelearningmastery.com/how-to-implement-tool-calling-with-gemma-4-and-python/ before continuing on. From Conversation to Agency When the only tools you give a language model are read-only web APIs, essentially you still really have a chatbot, albeit one with potential access to better information. The model receives a prompt, decides which API to ping, and stitches the JSON response into a paragraph. There is no real notion of environment , no state to inspect, no consequence to reason about; it's a scenario more akin to retrieval augmented generation https://www.kdnuggets.com/7-steps-to-mastering-retrieval-augmented-generation than true agency. Agency, in the practical sense practitioners use the word, shows up when a model starts interacting with the system it is running on. That can mean reading from a local filesystem, executing code, modifying files, calling other processes, or any combination of those. The moment a tool can do something other than return a clean string from a remote service, the model has to start asking about itself: what files exist, what does this number actually equal, what is in this folder before I claim it contains anything. The Gemma 4 family, and specifically the gemma4:e2b edge variant we have been using, is small enough to run locally on a laptop while being competent enough at structured output to drive this kind of loop reliably. That combination is what makes the local-agentic pattern interesting in the first place. The complete code for this tutorial can be found here https://github.com/mmmayo13/gemma 4 tool calling/blob/main/cmd calling.py . The Architectural Reuse The orchestration loop from the previous tutorial does not change. We define Python functions, expose them via JSON schema, pass the registry to Ollama alongside the user prompt, intercept any tool calls block on the response, execute the requested function locally, append the result as a tool -role message, and re-query the model so it can synthesize a final answer. The same call ollama helper, the same TOOL FUNCTIONS dictionary, the same available tools schema array from the previous tutorial all make appearances. What changes is the nature of the tools themselves. Where the previous batch were all thin clients over remote APIs, those we will build now both run code on the machine. That shifts the design problem from "how do I parse this response" to "how do I make sure the model cannot, even accidentally, do something it should not be allowed to do." Tool 1: A Sandboxed Filesystem Explorer The first tool, list directory contents , gives the model the ability to see what files exist in a given folder. This sounds trivial until you remember that os.listdir accepts any string, including / , ~ , and ../../etc . A naive implementation could happily walk the model's "curiosity" straight to your API keys. The design choice here is to pin a safe base directory at script start and reject any request that resolves outside of it: Security: confine list directory contents to this base directory and its descendants Set to the current working directory when the script starts SAFE BASE DIR = os.path.abspath os.getcwd def list directory contents path: str = "." - str: """Lists files and directories within a path, constrained to the safe base directory.""" try: Resolve to an absolute path and verify it sits inside SAFE BASE DIR This blocks traversal attempts like '../../etc' or absolute paths like '/' requested = os.path.abspath os.path.join SAFE BASE DIR, path if not requested == SAFE BASE DIR or requested.startswith SAFE BASE DIR + os.sep : return f"Error: Access denied. The path '{path}' resolves outside the " f"permitted workspace {SAFE BASE DIR} ." ... The pattern is simple but worth considering further. We never trust the string the model produced. We join it onto the base directory, resolve it absolutely so .. gets normalized away , and then verify the resolved path still starts with the base. Both /etc/passwd and ../../somewhere collapse into paths that fail that prefix check and are rejected before os.listdir is ever called. The rest of the function is housekeeping: confirm the path exists and is a directory, list its contents, and format each entry as either DIR or FILE with a byte size. The returned string is plain English with structure the model can parse on the second pass: entries = sorted os.listdir requested if not entries: return f"The directory '{path}' is empty." lines = f"Contents of '{path}' {len entries } item s :" for name in entries: full = os.path.join requested, name if os.path.isdir full : lines.append f" DIR {name}/" else: try: size = os.path.getsize full lines.append f" FILE {name} {size} bytes " except OSError: lines.append f" FILE {name}" return "\n".join lines The JSON schema we hand to the model is deliberately permissive on the parameter side — path is optional, defaulting to the workspace root, because most useful first questions are about the current folder: { "type": "function", "function": { "name": "list directory contents", "description": "Lists files and subdirectories inside a path within the user's workspace. " "Use this to inspect the environment before answering questions about local files." , "parameters": { "type": "object", "properties": { "path": { "type": "string", "description": "A relative path inside the workspace, e.g. '.', 'data', or 'src/utils'. " "Defaults to the workspace root." } }, "required": } } } Note the description does a small amount of prompt engineering: "Use this to inspect the environment before answering questions about local files." That sentence pushes Gemma 4 toward calling the tool when the user asks a vague question about "my files" rather than guessing at what might be there. Tool 2: A Restricted Python Interpreter The second tool, execute python code , is the more dangerous and the more pedagogically interesting of the two. The premise is that language models, especially small ones, are unreliable at precise arithmetic, exact string manipulation, and anything involving more than a couple of steps of branching logic. A tool that lets the model write and run a deterministic snippet is a much better answer to those problems than asking it to reason through them in natural language. The implementation uses exec with a deliberately stripped-down builtins namespace: php def execute python code code: str - str: """Executes a snippet of Python code and returns whatever was printed to stdout. This is a learning-only sandbox. exec is fundamentally unsafe; do not expose this tool to untrusted users or networks. The restrictions below stop the casual cases, not a determined attacker. """ try: A minimal restricted environment. We strip builtins down to a small whitelist so that, e.g., open , eval , and import are not directly available from the snippet's global scope. safe builtins = { "abs": abs, "all": all, "any": any, "bool": bool, "dict": dict, "divmod": divmod, "enumerate": enumerate, "filter": filter, "float": float, "int": int, "len": len, "list": list, "map": map, "max": max, "min": min, "pow": pow, "print": print, "range": range, "repr": repr, "reversed": reversed, "round": round, "set": set, "sorted": sorted, "str": str, "sum": sum, "tuple": tuple, "zip": zip, } Pre-import a couple of safe, useful modules so the model doesn't have to. import math, statistics restricted globals = { " builtins ": safe builtins, "math": math, "statistics": statistics, } A few decisions worth calling out. We replace builtins entirely rather than blacklisting individual functions, which means open , eval , exec , compile , import , input , and anything else not in our whitelist simply does not exist inside the snippet. We pre-import math and statistics into the snippet's globals because the model will reach for them constantly and we would rather not force it to fight import restrictions. We capture stdout with contextlib.redirect stdout so the model gets back exactly what its snippet printed: Capture stdout so we can hand the printed output back to the model buffer = io.StringIO with contextlib.redirect stdout buffer : exec code, restricted globals, {} output = buffer.getvalue .strip if not output: return "Code executed successfully but produced no output. Use print to return a value." return f"Output:\n{output}" The empty-output branch matters more than it looks. Small models will routinely write expressions like x = sum range 101 and forget the print x . Returning a specific error telling them to use print gives the orchestration loop the option to retry; without it, the model would synthesize a final answer based on an empty string and confidently invent a value. A final word on safety, since the script's docstring is blunt about it: this is a learning sandbox, not a hardened one. A determined adversary can break out of a Python exec sandbox in a dozen ways, most of them involving object introspection through . class . mro . For a single-user agent running on your own laptop on your own prompts, the whitelist is plenty. For anything else, you would want a real isolation layer — a subprocess with seccomp , a container, or RestrictedPython . The Orchestration Loop The main loop is unchanged in structure from the previous tutorial. The model is queried with the user prompt and the tool registry, and if it responds with tool calls , each call is dispatched against TOOL FUNCTIONS : if "tool calls" in message and message "tool calls" : print " TOOL EXECUTION " messages.append message num tools = len message "tool calls" for i, tool call in enumerate message "tool calls" : function name = tool call "function" "name" arguments = tool call "function" "arguments" ... if function name in TOOL FUNCTIONS: func = TOOL FUNCTIONS function name try: result = func arguments ... messages.append { "role": "tool", "content": str result , "name": function name } The CLI formatting is worth a small tweak for this script. The execute python code tool's code argument can be a multi-line string with newlines in it, which will wreck an ASCII tree if printed naively. We flatten and truncate string arguments for the display only; the model still receives the full string when the function runs: python def short v : if isinstance v, str : flat = v.replace "\n", "\\n" if len flat 60: flat = flat :57 + "..." return f"'{flat}'" return str v args str = ", ".join f"{k}={ short v }" for k, v in arguments.items Once each tool result is appended back into the message history as a "role": "tool" entry, we re-call Ollama with the enriched payload and the model produces its grounded final answer. Same two-pass pattern, same logic. Testing the Tools And now we test our tool calling. Pull gemma4:e2b with ollama pull gemma4:e2b if you have not already, then run the script from a folder you do not mind the model peeking at. Let's start with the filesystem tool. From the project directory: What scripts are in my current folder, and which one looks like it should be used to process CSVs? Result: SYSTEM ○ Tool: execute python code...................... LOADED ○ Tool: list directory contents.................. LOADED ○ Workspace: /Users/matt/projects/gemma agent..... SANDBOXED PROMPT What scripts are in my current folder, and which one looks like it should be used to process CSVs? EXECUTION ● Querying model... TOOL EXECUTION └── Calling: list directory contents ├─ Args: path='.' └─ Result: Contents of '.' 5 item s : FILE README.md 412 bytes FILE csv cleaner.py 1834 bytes FILE main.py 10786 bytes FILE notes.txt 88 bytes FILE sales report.py 2210 bytes EXECUTION ● Synthesizing results... RESPONSE Your current folder contains five files. The one that looks intended for CSV processing is csv cleaner.py — its name strongly suggests it handles CSV input. sales report.py may also touch CSV data, but its name is more about output than ingestion. The model called the tool, looked at the actual filenames, and made a reasonable inference grounded in the listing rather than in its weights. That is the difference between hallucination and observation. Next, the Python interpreter. A small task that small models reliably get wrong if asked to do it in their head: What is the standard deviation of the numbers 12, 18, 23, 24, 29, 31, 35, 41, 44, 47, rounded to four decimal places? Result: SYSTEM ○ Tool: execute python code...................... LOADED ○ Tool: list directory contents.................. LOADED ○ Workspace: /Users/matt/projects/gemma agent..... SANDBOXED PROMPT What is the standard deviation of the numbers 12, 18, 23, 24, 29, 31, 35, 41, 44, 47, rounded to four decimal places? EXECUTION ● Querying model... TOOL EXECUTION └── Calling: execute python code ├─ Args: code='nums = 12, 18, 23, 24, 29, 31, 35, 41, 44, 47 \nprint rou...' └─ Result: Output: 11.4659 EXECUTION ● Synthesizing results... RESPONSE The standard deviation of those numbers, rounded to four decimal places, is 11.4659. The model offloaded the calculation entirely; it wrote a snippet, called statistics.stdev , rounded the result, and reported what the interpreter said. No mental arithmetic, no approximation, no fabricated significant digits. Finally, the more interesting case: a prompt that requires both tools in sequence. The model has to inspect the folder and compute something about what it finds: Look at the files in the current folder and tell me the total size in kilobytes, rounded to two decimal places. Output: SYSTEM ○ Tool: execute python code...................... LOADED ○ Tool: list directory contents.................. LOADED ○ Workspace: /Users/matt/projects/gemma agent..... SANDBOXED PROMPT Look at the files in the current folder and tell me the total size in kilobytes, rounded to two decimal places. EXECUTION ● Querying model... TOOL EXECUTION ┌── Calling: list directory contents │ ├─ Args: path='.' │ └─ Result: Contents of '.' 5 item s : │ FILE README.md 412 bytes │ FILE csv cleaner.py 1834 bytes │ FILE main.py 10786 bytes │ FILE notes.txt 88 bytes │ FILE sales report.py 2210 bytes │ └── Calling: execute python code ├─ Args: code='sizes = 412, 1834, 10786, 88, 2210 \nprint round sum siz...' └─ Result: Output: 15.33 EXECUTION ● Synthesizing results... RESPONSE The five files in the current folder total 15.33 KB. Two tools, in the right order, with the output of one feeding the argument of the other — produced by a 2-billion-parameter model running on a laptop with no GPU. The filesystem tool grounds the model in what is actually there; the interpreter tool grounds the answer in what is actually true. The model contributes the part it is genuinely good at, which is deciding which question to ask of which tool. It is worth poking at the safety guards too, just to confirm they hold. Asking the model "list the contents of /etc " produces the expected denial message in the tool result, which the model then reports back gracefully rather than fabricating a directory listing. Asking it to run open '/etc/passwd' .read inside the interpreter produces a NameError , since open is not in the whitelisted builtins. Both failures degrade into useful error strings instead of silent compromises, which is exactly what you want at this layer. Conclusion The earlier tutorial https://machinelearningmastery.com/how-to-implement-tool-calling-with-gemma-4-and-python/ showed that Gemma 4 can reach across the internet on your behalf. This one shows it can reach into the machine you are sitting at, carefully, when you have built the carefulness in. Once you have a working tool-calling loop, the interesting question stops being "can the model call a function" and starts being "what should I let it touch." A filesystem-aware tool and a code-execution tool together get you most of the way to something that genuinely earns the term agent : it can observe its environment, decide what calculation matters, and run that calculation deterministically rather than guessing. The pattern generalizes from there. Database queries, shell commands, git operations, document parsing; each one of these is the same JSON schema, the same dispatch table, the same two-pass synthesis, with whatever safety perimeter is appropriate for the blast radius of the underlying call. Build the perimeter first. Then hand the model the keys to whatever sits inside it. Matthew Mayo https://www.kdnuggets.com/wp-content/uploads/./profile-pic.jpg holds a master's degree in computer science and a graduate diploma in data mining. As managing editor of https://twitter.com/mattmayo13 @mattmayo13 KDnuggets https://www.kdnuggets.com/ & Statology https://www.statology.org/ , and contributing editor at Machine Learning Mastery https://machinelearningmastery.com/ , Matthew aims to make complex data science concepts accessible. His professional interests include natural language processing, language models, machine learning algorithms, and exploring emerging AI. He is driven by a mission to democratize knowledge in the data science community. Matthew has been coding since he was 6 years old.