I Built an AI That Decides Which AI to Talk To — Running 24/7 From My Living Room

Based on the article, the author built an autonomous AI agent called OpenClaw that runs 24/7 on a Raspberry Pi and manages tasks like research, coding, and document editing. To optimize costs and performance, the author created a lightweight Python router that automatically directs simple requests to a local, free LLM on a Mac Mini and complex reasoning tasks to paid frontier models. The system uses Google's open-source AgentGateway to unify the endpoints, handle authentication, and provide observability without the client agent knowing which backend model is used.

Last Saturday when I waked up, my AI agent reviewed 14 restaurant ratings in Indiranagar, updated a shared Google Sheet, signed a 20-page PDF I'd been ignoring for a week, and wrote a bash script to clean up my server logs. I didn't ask it to do any of that. It just... does things now. Meet OpenClaw — my long-running autonomous agent that lives on a Raspberry Pi, plugged into Discord, running 24/7. It manages my memory, handles research, writes code, edits documents, finds the best weekend spots in Bangalore by scraping live ratings — basically, it runs half my life on autopilot. But a few weeks ago, I noticed something that bothered me. I asked it: "Write a Python script to parse JSON logs." Simple coding task. It sent that request to a cloud API, waited 3 seconds, burned tokens I paid for, and came back with an answer — when I had a perfectly capable local LLM sitting idle on my Mac Mini, three feet away. Then I asked: "Think step by step about the trade-offs between event-driven vs polling architecture for my notification system." That's a hard reasoning question. I want that going to a frontier model. That's worth the tokens. Same agent. Same endpoint. Completely different needs. And that's when a stupid idea hit me: What if the system could figure out which brain to use — before the request even reaches a model? Turns out, it's not stupid at all. And it took me a weekend, a Raspberry Pi, a Mac Mini, 50 lines of Python, and an open-source gateway to build it. Here's how. Here's what's running in my living room: Raspberry Pi → Runs OpenClaw, my autonomous agent. It takes input from Discord, manages context, memory, and orchestrates everything. Mac Mini → The brain farm. Runs three things: Ollama with qwen2.5-coder:7b — a local coding model that never leaves my network AgentGateway — an open-source AI gateway from Google that handles routing, auth, observability A lightweight Python router — the "intent classifier" I wrote in ~50 lines of code The magic? OpenClaw doesn't know any of this is happening. It just sends a request to one endpoint. Behind the scenes, the system figures out the rest. Three models. Three price points. One unified endpoint. OpenClaw just hits http://192.168.1.15:1234/v1/chat/completions and forgets about it. I evaluated a few options — raw Envoy, Nginx with Lua scripting, even building a full proxy from scratch. But AgentGateway stood out for a few reasons: What it gives you out of the box: Protocol translation — It speaks OpenAI-compatible API on the frontend, but can talk to Gemini, Vertex AI, Bedrock, Ollama, and more on the backend. I don't write a single line of provider-specific code. Backend authentication — API keys are managed at the gateway level. OpenClaw never sees or stores any API key. I just set backendAuth: key: $GEMINI API KEY in the config and it handles the rest. Model aliasing — OpenClaw sends model: "inteli-llm" in every request. AgentGateway silently translates that to qwen2.5-coder:7b, gpt-4o, or gemini-2.5-flash depending on which route matched. The client has no idea. Observability — Every request gets logged with provider name, model, token counts, and latency. I can see exactly how many tokens are going to OpenAI vs staying local. Prompt guards & rate limiting — Built-in regex-based PII masking, webhook-based content moderation, and rate limiting. Enterprise-grade features I get for free. Weighted load balancing & failover — If Ollama crashes it happens , I can configure automatic failover to a cloud model. No downtime. What it doesn't do yet : Content-aware routing. AgentGateway routes based on path, headers, and methods — which is the right design for a gateway. It doesn't peek into your request body to decide where to send it. That's a feature, not a bug — gateways should be fast and protocol-level, not parsing JSON payloads. But I needed content-aware routing. So instead of searching for other tool, I extended it. I wrote a tiny FastAPI proxy that sits in front of AgentGateway. Here's what it does: code , python , script , function , bug ? → codingthink , analyze , reasoning , deduce ? Or prompt 400 chars? → reasoningcoding keywords = "code", "python", "javascript", "bash", "script", "function", "bug" reasoning keywords = "think", "analyze", "explain in detail", "reasoning", "logic", "deduce" if any k in prompt lower for k in coding keywords : intent = "coding" elif len prompt 400 or any k in prompt lower for k in reasoning keywords : intent = "reasoning" else: intent = "simple" Here's what this setup actually saves me: Before this setup, every single request was going to a cloud API. Now, roughly 60-70% of my queries stay local — coding questions, quick lookups, simple formatting tasks. They're fast, free, and private. The expensive reasoning model only gets called when I genuinely need it. And the mid-tier Gemini handles everything in between. My monthly API bill dropped significantly, and the local responses are actually faster. 1. Header-based routing over path-based routing Initially, I was going to use URL paths /coding , /reasoning , /simple and strip them with URL rewriting. But header injection is cleaner — the original request path stays intact, and AgentGateway's header matching is first-class. 2. Classification at the proxy, not the gateway I could have tried to use AgentGateway's CEL expressions or ExtProc policies for classification. But those run after backend selection, not before. Keeping classification in a separate lightweight layer means I can swap algorithms without touching my gateway config. 3. Keyword heuristics over ML classifiers Could I use a small classifier model or even RouteLLM for smarter routing? Absolutely. But for a homelab, keyword matching is: 4. One unified model name OpenClaw sends model: "inteli-llm" for everything. AgentGateway's modelAliases feature translates it per-route. This means I can swap out backend models without touching a single line of OpenClaw's config. Last week it was gemini-1.5-flash , this week it's gemini-2.5-flash . OpenClaw never knew. Smarter classification — Maybe a tiny local classifier model, or even using the first few tokens of a response to reclassify and retry on a better model. Metrics dashboard — AgentGateway already emits OpenTelemetry traces. I want to hook up a Grafana dashboard to see which models are handling what, with latency and token breakdowns. Failover chains — If Ollama is under heavy load, automatically fall back to Gemini for coding tasks. AgentGateway supports priority groups for this. More agents — OpenClaw is just the beginning. I want to run specialized agents for different domains, all routing through the same gateway. You don't need a Kubernetes cluster or a $10K GPU server to build a multi-model AI system. A Raspberry Pi, a Mac Mini, an open-source gateway, and 50 lines of Python got me: ✅ An always-on autonomous agent ✅Intelligent routing ✅across 3 different LLMs ✅Local-first for privacy and speed ✅Cloud when I need the horsepower ✅Zero API keys exposed to the client ✅A monthly bill I actually don't mind paying The best part? The entire config is a single YAML file and a single Python script. No Docker. No Kubernetes. No Terraform. Just two processes on a Mac Mini and an agent on a Pi. Sometimes the best infrastructure is the one you can explain in a napkin sketch. If you're building something similar or want to see the config files, drop a comment — happy to share the full setup.