How I Replaced Gemini with a Self-Hosted LLM for Two Production Apps

A developer replaced Google's Gemini 3 Flash with a self-hosted Qwen model via Ollama for two production applications, citing cost, control, and infrastructure economics. The setup uses a Mac mini as primary through a Cloudflare tunnel and an Oracle Cloud VM as fallback. The move was driven by the desire to treat AI as shared infrastructure rather than a metered API.

A while back I wrote about my terminal-inspired portfolio https://dev.to/smngvlkz/a-calm-terminal-inspired-portfolio-focused-on-shipped-products-ga8 and the products it indexes. Two of those products lean on a language model: the portfolio terminal at smngvlkz.com https://smngvlkz.com that you can ask questions, and PayChasers https://paychasers.com , which generates OPTIONAL payment follow-up emails. Both started on Google's Gemini 3 Flash. Both now run on a model I host myself, with a fallback chain that keeps them alive when my hardware is not. This is the story of that move. The experiment that started it, why I committed to it, what the architecture looks like, the night it broke, and the parts I still have not solved. When Qwen 3.5 was announced, it made me curious about how far open models have actually come. Instead of reading benchmarks, I tested it the way I like to learn things, by running it. It began as a small experiment on my base Mac mini. I pulled Qwen through Ollama https://ollama.com just to see how capable the model would be running directly on a local machine. The results were far better than I expected. Good enough that I stopped thinking of it as a toy and started thinking about production. Gemini 3 Flash worked. The integration was a few lines and the quality was good. So this was not a "the API is bad" story. It was three smaller pulls that added up. The first was cost shape. PayChasers generates optional email drafts on demand, and every preview is a few thousand tokens of system prompt plus output. That is fine at zero users and a slow leak at volume. The marginal cost of an inference I run on a machine I already own is electricity. The second was control and privacy. I wanted to choose the model, pin it, and change the prompt contract without a provider deprecating something underneath me. I also did not love sending client names and payment context to a third party when I did not have to. The third was the economics of treating AI as infrastructure rather than a metered API. Once the model runs on hardware I control, it stops being a per-call expense and becomes shared infrastructure that multiple applications can use. The same inference server now powers two different products. That reframing is the whole point. The original plan was to host the model on Oracle Cloud using one of their free Ampere ARM instances in the Johannesburg region. If you have ever tried to get one, you know the struggle. Free tier ARM capacity is brutally limited, and after more than 200 automated retry attempts across two days, I still could not get one. So I pivoted. I wrote a lightweight reverse proxy, set up a Cloudflare Tunnel on one of my domains, and routed production traffic to the model running on my Mac at home. No ports opened on my home network, no static IP, just a tunnel from Cloudflare's edge to the machine on my desk. It was meant to be temporary. The Oracle instance eventually did come through, but by then the home setup was working well, so I did not throw it away. Instead I kept the Mac mini as the primary and gave Oracle a different job, the always on backup. More on that in a moment. This was a small full-circle moment. The Linux and infrastructure fundamentals I picked up during my bootcamp days and years of self teaching showed up in a real production context. Provisioning tunnels, configuring DNS, writing a proxy service, setting up persistent services. All of it coming together for something real. One deliberate decision was to keep the infrastructure simple. There are a lot of frameworks and agent systems appearing in the space right now. I focused on straightforward tooling that solved the problems I actually had. The Mac mini, exposed through Cloudflare tunnel, is the primary . It is fast but it is not always on, because it is a machine in my home. The Oracle Cloud VM is the fallback . It is slower and smaller, but it stays up around the clock. Every app talks to a thin client that knows about both, tries the fast one first, and silently falls back to the reliable one. Vercel app | v primary: Mac mini via Cloudflare tunnel --fail/timeout-- fallback: Oracle Cloud VM fast, not always on slow, always on This is the whole idea in one function. Hit the primary with a timeout. If anything goes wrong, the status, the timeout, a dropped tunnel, fall through to the fallback. js const PRIMARY URL = process.env.OLLAMA PRIMARY URL || "http://localhost:11434"; const FALLBACK URL = process.env.OLLAMA FALLBACK URL || PRIMARY URL; async function fetchWithFallback path: string, body: object : Promise<Response { try { const res = await fetch ${PRIMARY URL}${path} , { method: "POST", headers: { "Content-Type": "application/json" }, body: JSON.stringify body , signal: AbortSignal.timeout 15000 , } ; if res.ok throw new Error Primary failed ${res.status} ; return res; } catch { const res = await fetch ${FALLBACK URL}${path} , { method: "POST", headers: { "Content-Type": "application/json" }, body: JSON.stringify body , } ; if res.ok { const text = await res.text .catch = "Unknown error" ; throw new Error Ollama request failed ${res.status} : ${text} ; } return res; } } A few small choices that matter more than they look: fetch , which can hang if Oracle is reachable but wedged. A long timeout would be the more defensible version of the same idea, and I have not added one yet catch swallows why the primary failed, no log, no signal. Fine for failing over, bad for diagnosing, and something I would tighten before I called this production hardened.Running your own models means you also get to decide which model serves which request. I do a very simple version of routing based on how many requests are in flight. js let activeRequests = 0; function selectModel : string { // 1 request: best quality. 2+: lighter model that handles concurrency. return activeRequests 1 ? FALLBACK MODEL : PRIMARY MODEL; } On the portfolio, a single visitor gets qwne3.5:latest , the better model. The moment two requests overlap, new requests drop to qwen2.5-coder:7b , which is lighter and keeps latency sane under concurrency. This is not sophisticated. It is one counter and a ternary. But it is the real cost and quality tradeoff in miniature, and on a single base Mac mini it is the difference between graceful and failing over. I also pass two Ollama options that earn their keep: keep alive: -1 keeps the model resident in memory so the next request does not pay the cold load. think: false turns off the reasoning tokens, because for a portfolio terminal and an email draft I want the answer, not the monologue.The cheapest inference is the one you never run. Previously my portfolio terminal used Gemini 3 Flash for natural language queries while common commands were handled locally without AI. I kept that split when I moved the natural language layer onto my own infrastructure. js const lowerQuery = query.toLowerCase .trim ; if lowerQuery === "help" { / return static command list / } if lowerQuery === "list all" { / return products + systems from data / } if lowerQuery === "show activity" { / return GitHub/GitLab stats / } const showMatch = lowerQuery.match /^show\s+ \w- + \s+ \w+ $/ ; if showMatch { / answer straight from structured data / } // only open-ended natural language falls through to the model help , list , show , and explain are answered straight from the typed data. Only genuinely open-ended questions stream from the model. It is faster, it is free, and it is more reliable than asking a 7B model to format a list it could get wrong. For the open-ended path, the portfolio streams tokens over server-sent events. Ollama returns newline-delimited JSON, so the route reads the body, split on newlines, and re-emits each token as an SSE frame. js const reader = response.body .getReader ; const decoder = new TextDecoder ; let buffer = ""; while true { const { done, value } = await reader.read ; if done break; buffer += decoder.decode value, { stream: true } ; const lines = buffer.split "\n" ; buffer = lines.pop || ""; for const line of lines { if line.trim continue; const chunk = JSON.parse line ; const token = chunk.message?.content || ""; if token { controller.enqueue encoder.encode data: ${JSON.stringify { type: "token", content: token } }\n\n ; } } } Both products stream responses token by token and run entirely on infrastructure I control. PayChasers is where the prompt work actually lives, because the output is not a chat bubble, it is an email that gets sent to someone's client. Two things make a self-hosted 7B model reliable enough for that. First, the model never writes real values. It writes placeholders, and the app fills them in. This keeps the model from hallucinating an amount or a name. CRITICAL: You MUST use these exact placeholder variables instead of real values: - {clientName} for the recipient's name - {dueDate} for the due date - {amount} for the amount owed - {daysOverdue} for the number of days overdue For example: "Hey {clientName}," NOT "Hey John,". Return ONLY valid JSON. Second, the tone escalates with how late the payment is, decided in code, not left to the model's mood. function determineTone daysOverdue: number { if daysOverdue = 14 return "urgent"; if daysOverdue = 7 return "firm"; return "friendly"; } And because a local model will occasionally wrap its JSON in a code fence or stray <think block no matter how firmly you ask, the parser is defensive rather than trusting. js function extractJson text: string { const cleaned = text.replace /<think \s\S ?<\/think /g, "" .trim ; try { return JSON.parse cleaned ; } catch {} const fence = cleaned.match / ?:json ?\s \s\S ? / ; if fence { try { return JSON.parse fence 1 .trim ; } catch {} } const first = cleaned.indexOf "{" ; const last = cleaned.lastIndexOf "}" ; if first == -1 && last first { try { return JSON.parse cleaned.slice first, last + 1 ; } catch {} } } Self-hosting a smaller model means you trade some of the provider's polish for parsing you own. That is a fair trade when the upside is control and cost. Then I learned the lesson that every self-hoster learns eventually. There was a small power outage one night around 20:00. The Mac mini, my primary inference node, switched off, and it never came back on. I only realised the next morning. PayChasers failed over to the Oracle backup automatically, exactly as it should have. But the floating terminal in my portfolio had no failover, so it just sat there dead all night. Anyone who was bored enough to try and poke at my portfolio that night got nothing. Two lessons came out of that morning: fetchWithFallback client that PayChasers already had.Self-hosting your own AI is great until you are the one on call at 8am on a Saturday, and there is no one else to escalate to, because it is your own thing So I built the monitoring I should have had first. PayChasers runs a small cron that health-checks both Ollama endpoints and emails me, but only on state transition , up to down or down to up. It keeps the last known state in Upstash Redis so it does not spam me every 5 minutes while the mini is asleep. js const ENDPOINTS = { name: "primary-mac", url: process.env.OLLAMA PRIMARY URL }, { name: "fallback-oracle", url: process.env.OLLAMA FALLBACK URL }, ; // hit /api/tags on each, compare to stored state in Redis, // send a Resend email only when ok flips. Auth via CRON SECRET. Now when the mini goes offline, traffic quietly shifts to Oracle and I get exactly one email telling me so. That is the entire operations story, and that is the amount of operations story I want for a side project. I want to be honest about the edges, because the architecture above is the easy part. My evaluation is still vibes. I read the generated emails, decide they look good, and ship. I do not have an eval harness scoring tone, placeholder correctness, or JSON validity across a fixed set of cases. I should. When I claim qwen3.5 is "better" than qwen2.5-coder for a request, that is intuition, not a benchmark. The irony is that the plumbing is already there. PayChasers runs PostHog for the product funnel, signups, chases created, upgrades. Capturing AI events would be trivial. A draft generated , draft accepted , draft edited , draft regenerated funnel would tell me, with real users, how often a generated email ships untouched versus gets rewritten. That acceptance rate is a real quality signal, and it is the cheapest first step from vibes towards measurement. I just have not wired it yet. My model selection is instinct, not measurement. I picked these Qwen models because they ran well on my hardware and read well in practice. A systematic version would measure latency, quality, and cost per model and route on data. And I have not touched retrieval. Both apps stuff their full context into the system prompt, which is fine at this size and would fall apart the moment the data outgrew the window. There is no RAG here, and I have not yet had to reach for it. I am pointing at these on purpose. The move off Gemini taught me serving, the cost and reliability tradeoff, basic routing, and prompt constraining by doing them. The next layer, real evaluation and measured model choice, is the part I am learning now. Open models have come a long way. It is becoming genuinely practical to run useful AI systems on relatively small infrastructure. No GPU cluster required. What started as a small experiment on a base mini is now live for real users across two products, on infrastructure I own. This is not a finished system. It is a snapshot of how I run a model I control today, and a map of what I am building next