{"slug": "how-i-replaced-gemini-with-a-self-hosted-llm-for-two-production-apps", "title": "How I Replaced Gemini with a Self-Hosted LLM for Two Production Apps", "summary": "A developer replaced Google's Gemini 3 Flash with a self-hosted Qwen model via Ollama for two production applications, citing cost, control, and infrastructure economics. The setup uses a Mac mini as primary through a Cloudflare tunnel and an Oracle Cloud VM as fallback. The move was driven by the desire to treat AI as shared infrastructure rather than a metered API.", "body_md": "A while back I wrote about [my terminal-inspired portfolio](https://dev.to/smngvlkz/a-calm-terminal-inspired-portfolio-focused-on-shipped-products-ga8) and the products it indexes. Two of those products lean on a language model: the portfolio terminal at [smngvlkz.com](https://smngvlkz.com) that you can ask questions, and [PayChasers](https://paychasers.com), which generates OPTIONAL payment follow-up emails. Both started on Google's Gemini 3 Flash. Both now run on a model I host myself, with a fallback chain that keeps them alive when my hardware is not.\n\nThis is the story of that move. The experiment that started it, why I committed to it, what the architecture looks like, the night it broke, and the parts I still have not solved.\n\nWhen Qwen 3.5 was announced, it made me curious about how far open models have actually come. Instead of reading benchmarks, I tested it the way I like to learn things, by running it.\n\nIt began as a small experiment on my base Mac mini. I pulled Qwen through [Ollama](https://ollama.com) just to see how capable the model would be running directly on a local machine. The results were far better than I expected. Good enough that I stopped thinking of it as a toy and started thinking about production.\n\nGemini 3 Flash worked. The integration was a few lines and the quality was good. So this was not a \"the API is bad\" story. It was three smaller pulls that added up.\n\nThe first was cost shape. PayChasers generates optional email drafts on demand, and every preview is a few thousand tokens of system prompt plus output. That is fine at zero users and a slow leak at volume. The marginal cost of an inference I run on a machine I already own is electricity.\n\nThe second was control and privacy. I wanted to choose the model, pin it, and change the prompt contract without a provider deprecating something underneath me. I also did not love sending client names and payment context to a third party when I did not have to.\n\nThe third was the economics of treating AI as infrastructure rather than a metered API. Once the model runs on hardware I control, it stops being a per-call expense and becomes shared infrastructure that multiple applications can use. The same inference server now powers two different products. That reframing is the whole point.\n\nThe original plan was to host the model on Oracle Cloud using one of their free Ampere ARM instances in the Johannesburg region. If you have ever tried to get one, you know the struggle. Free tier ARM capacity is brutally limited, and after more than 200 automated retry attempts across two days, I still could not get one.\n\nSo I pivoted. I wrote a lightweight reverse proxy, set up a Cloudflare Tunnel on one of my domains, and routed production traffic to the model running on my Mac at home. No ports opened on my home network, no static IP, just a tunnel from Cloudflare's edge to the machine on my desk.\n\nIt was meant to be temporary. The Oracle instance eventually did come through, but by then the home setup was working well, so I did not throw it away. Instead I kept the Mac mini as the primary and gave Oracle a different job, the always on backup. More on that in a moment.\n\nThis was a small full-circle moment. The Linux and infrastructure fundamentals I picked up during my bootcamp days and years of self teaching showed up in a real production context. Provisioning tunnels, configuring DNS, writing a proxy service, setting up persistent services. All of it coming together for something real.\n\nOne deliberate decision was to keep the infrastructure simple. There are a lot of frameworks and agent systems appearing in the space right now. I focused on straightforward tooling that solved the problems I actually had.\n\nThe Mac mini, exposed through Cloudflare tunnel, is the **primary**. It is fast but it is not always on, because it is a machine in my home. The Oracle Cloud VM is the **fallback**. It is slower and smaller, but it stays up around the clock.\n\nEvery app talks to a thin client that knows about both, tries the fast one first, and silently falls back to the reliable one.\n\n```\nVercel app\n   |\n   v\n[ primary: Mac mini via Cloudflare tunnel ]  --fail/timeout-->  [ fallback: Oracle Cloud VM ]\n        fast, not always on                                          slow, always on\n```\n\nThis is the whole idea in one function. Hit the primary with a timeout. If anything goes wrong, the status, the timeout, a dropped tunnel, fall through to the fallback.\n\n``` js\nconst PRIMARY_URL = process.env.OLLAMA_PRIMARY_URL || \"http://localhost:11434\";\nconst FALLBACK_URL = process.env.OLLAMA_FALLBACK_URL || PRIMARY_URL;\n\nasync function fetchWithFallback(path: string, body: object): Promise<Response> {\n    try {\n        const res = await fetch(`${PRIMARY_URL}${path}`, {\n            method: \"POST\",\n            headers: { \"Content-Type\": \"application/json\" },\n            body: JSON.stringify(body),\n            signal: AbortSignal.timeout(15000),\n        });\n        if (!res.ok) throw new Error(`Primary failed (${res.status})`);\n        return res;\n    } catch {\n        const res = await fetch(`${FALLBACK_URL}${path}`, {\n            method: \"POST\",\n            headers: { \"Content-Type\": \"application/json\" },\n            body: JSON.stringify(body),\n        });\n        if (!res.ok) {\n            const text = await res.text().catch(() => \"Unknown error\");\n            throw new Error(`Ollama request failed (${res.status}): ${text}`);\n        }\n        return res;\n    }\n}\n```\n\nA few small choices that matter more than they look:\n\n`fetch`\n\n, which can hang if Oracle is reachable but wedged. A long timeout would be the more defensible version of the same idea, and I have not added one yet`catch`\n\nswallows why the primary failed, no log, no signal. Fine for failing over, bad for diagnosing, and something I would tighten before I called this production hardened.Running your own models means you also get to decide which model serves which request. I do a very simple version of routing based on how many requests are in flight.\n\n``` js\nlet activeRequests = 0;\n\nfunction selectModel(): string {\n    // 1 request: best quality. 2+: lighter model that handles concurrency.\n    return activeRequests > 1 ? FALLBACK_MODEL : PRIMARY_MODEL;\n}\n```\n\nOn the portfolio, a single visitor gets `qwne3.5:latest`\n\n, the better model. The moment two requests overlap, new requests drop to `qwen2.5-coder:7b`\n\n, which is lighter and keeps latency sane under concurrency. This is not sophisticated. It is one counter and a ternary. But it is the real cost and quality tradeoff in miniature, and on a single base Mac mini it is the difference between graceful and failing over.\n\nI also pass two Ollama options that earn their keep:\n\n`keep_alive: -1`\n\nkeeps the model resident in memory so the next request does not pay the cold load.`think: false`\n\nturns off the reasoning tokens, because for a portfolio terminal and an email draft I want the answer, not the monologue.The cheapest inference is the one you never run. Previously my portfolio terminal used Gemini 3 Flash for natural language queries while common commands were handled locally without AI. I kept that split when I moved the natural language layer onto my own infrastructure.\n\n``` js\nconst lowerQuery = query.toLowerCase().trim();\n\nif (lowerQuery === \"help\") { /* return static command list */ }\nif (lowerQuery === \"list all\") { /* return products + systems from data */ }\nif (lowerQuery === \"show activity\") { /* return GitHub/GitLab stats */ }\n\nconst showMatch = lowerQuery.match(/^show\\s+([\\w-]+)\\s+(\\w+)$/);\nif (showMatch) { /* answer straight from structured data */ }\n\n// only open-ended natural language falls through to the model\n```\n\n`help`\n\n, `list`\n\n, `show`\n\n, and `explain`\n\nare answered straight from the typed data. Only genuinely open-ended questions stream from the model. It is faster, it is free, and it is more reliable than asking a 7B model to format a list it could get wrong.\n\nFor the open-ended path, the portfolio streams tokens over server-sent events. Ollama returns newline-delimited JSON, so the route reads the body, split on newlines, and re-emits each token as an SSE frame.\n\n``` js\nconst reader = response.body!.getReader();\nconst decoder = new TextDecoder();\nlet buffer = \"\";\n\nwhile (true) {\n    const { done, value } = await reader.read();\n    if (done) break;\n    buffer += decoder.decode(value, { stream: true });\n    const lines = buffer.split(\"\\n\");\n    buffer = lines.pop() || \"\";\n    for (const line of lines) {\n        if (!line.trim()) continue;\n        const chunk = JSON.parse(line);\n        const token = chunk.message?.content || \"\";\n        if (token) {\n            controller.enqueue(encoder.encode(\n                `data: ${JSON.stringify({ type: \"token\", content: token })}\\n\\n`\n            ));\n        }\n    }\n}\n```\n\nBoth products stream responses token by token and run entirely on infrastructure I control.\n\nPayChasers is where the prompt work actually lives, because the output is not a chat bubble, it is an email that gets sent to someone's client. Two things make a self-hosted 7B model reliable enough for that.\n\nFirst, the model never writes real values. It writes placeholders, and the app fills them in. This keeps the model from hallucinating an amount or a name.\n\n```\nCRITICAL: You MUST use these exact placeholder variables instead of real values:\n- {clientName} for the recipient's name\n- {dueDate} for the due date\n- {amount} for the amount owed\n- {daysOverdue} for the number of days overdue\n\nFor example: \"Hey {clientName},\" NOT \"Hey John,\".\nReturn ONLY valid JSON.\n```\n\nSecond, the tone escalates with how late the payment is, decided in code, not left to the model's mood.\n\n```\nfunction determineTone(daysOverdue: number) {\n    if (daysOverdue >= 14) return \"urgent\";\n    if (daysOverdue >= 7) return \"firm\";\n    return \"friendly\";\n}\n```\n\nAnd because a local model will occasionally wrap its JSON in a code fence or stray `<think>`\n\nblock no matter how firmly you ask, the parser is defensive rather than trusting.\n\n``` js\nfunction extractJson(text: string) {\n    const cleaned = text.replace(/<think>[\\s\\S]*?<\\/think>/g, \"\").trim();\n    try { return JSON.parse(cleaned); } catch {}\n\n    const fence = cleaned.match(/```(?:json)?\\s*([\\s\\S]*?)```/);\n    if (fence) { try { return JSON.parse(fence[1].trim()); } catch {} }\n\n    const first = cleaned.indexOf(\"{\");\n    const last = cleaned.lastIndexOf(\"}\");\n    if (first !== -1 && last > first) {\n        try { return JSON.parse(cleaned.slice(first, last + 1)); } catch {}\n    }\n}\n```\n\nSelf-hosting a smaller model means you trade some of the provider's polish for parsing you own. That is a fair trade when the upside is control and cost.\n\nThen I learned the lesson that every self-hoster learns eventually.\n\nThere was a small power outage one night around 20:00. The Mac mini, my primary inference node, switched off, and it never came back on. I only realised the next morning.\n\nPayChasers failed over to the Oracle backup automatically, exactly as it should have. But the floating terminal in my portfolio had no failover, so it just sat there dead all night. Anyone who was bored enough to try and poke at my portfolio that night got nothing.\n\nTwo lessons came out of that morning:\n\n`fetchWithFallback`\n\nclient that PayChasers already had.Self-hosting your own AI is great until you are the one on call at 8am on a Saturday, and there is no one else to escalate to, because it is your own thing\n\nSo I built the monitoring I should have had first. PayChasers runs a small cron that health-checks both Ollama endpoints and emails me, but only on **state transition**, up to down or down to up. It keeps the last known state in Upstash Redis so it does not spam me every 5 minutes while the mini is asleep.\n\n``` js\nconst ENDPOINTS = [\n    { name: \"primary-mac\", url: process.env.OLLAMA_PRIMARY_URL },\n    { name: \"fallback-oracle\", url: process.env.OLLAMA_FALLBACK_URL },\n];\n\n// hit /api/tags on each, compare to stored state in Redis,\n// send a Resend email only when ok flips. Auth via CRON_SECRET.\n```\n\nNow when the mini goes offline, traffic quietly shifts to Oracle and I get exactly one email telling me so. That is the entire operations story, and that is the amount of operations story I want for a side project.\n\nI want to be honest about the edges, because the architecture above is the easy part.\n\nMy evaluation is still vibes. I read the generated emails, decide they look good, and ship. I do not have an eval harness scoring tone, placeholder correctness, or JSON validity across a fixed set of cases. I should. When I claim qwen3.5 is \"better\" than qwen2.5-coder for a request, that is intuition, not a benchmark.\n\nThe irony is that the plumbing is already there. PayChasers runs PostHog for the product funnel, signups, chases created, upgrades. Capturing AI events would be trivial. A `draft_generated`\n\n, `draft_accepted`\n\n, `draft_edited`\n\n, `draft_regenerated`\n\nfunnel would tell me, with real users, how often a generated email ships untouched versus gets rewritten. That acceptance rate is a real quality signal, and it is the cheapest first step from vibes towards measurement. I just have not wired it yet.\n\nMy model selection is instinct, not measurement. I picked these Qwen models because they ran well on my hardware and read well in practice. A systematic version would measure latency, quality, and cost per model and route on data.\n\nAnd I have not touched retrieval. Both apps stuff their full context into the system prompt, which is fine at this size and would fall apart the moment the data outgrew the window. There is no RAG here, and I have not yet had to reach for it.\n\nI am pointing at these on purpose. The move off Gemini taught me serving, the cost and reliability tradeoff, basic routing, and prompt constraining by doing them. The next layer, real evaluation and measured model choice, is the part I am learning now.\n\nOpen models have come a long way. It is becoming genuinely practical to run useful AI systems on relatively small infrastructure. No GPU cluster required. What started as a small experiment on a base mini is now live for real users across two products, on infrastructure I own.\n\nThis is not a finished system. It is a snapshot of how I run a model I control today, and a map of what I am building next", "url": "https://wpnews.pro/news/how-i-replaced-gemini-with-a-self-hosted-llm-for-two-production-apps", "canonical_source": "https://dev.to/smngvlkz/how-i-replaced-gemini-with-a-self-hosted-llm-for-two-production-apps-3069", "published_at": "2026-06-27 13:56:38+00:00", "updated_at": "2026-06-27 14:03:46.158146+00:00", "lang": "en", "topics": ["large-language-models", "ai-infrastructure", "developer-tools"], "entities": ["Google", "Gemini", "Qwen", "Ollama", "Cloudflare", "Oracle Cloud", "Mac mini", "PayChasers"], "alternates": {"html": "https://wpnews.pro/news/how-i-replaced-gemini-with-a-self-hosted-llm-for-two-production-apps", "markdown": "https://wpnews.pro/news/how-i-replaced-gemini-with-a-self-hosted-llm-for-two-production-apps.md", "text": "https://wpnews.pro/news/how-i-replaced-gemini-with-a-self-hosted-llm-for-two-production-apps.txt", "jsonld": "https://wpnews.pro/news/how-i-replaced-gemini-with-a-self-hosted-llm-for-two-production-apps.jsonld"}}