How I Replaced Gemini with a Self-Hosted LLM for Two Production Apps A developer replaced Google's Gemini 3 Flash with a self-hosted Qwen model via Ollama for two production applications, citing cost, control, and infrastructure economics. The setup uses a Mac mini as primary through a Cloudflare tunnel and an Oracle Cloud VM as fallback. The move was driven by the desire to treat AI as shared infrastructure rather than a metered API. A while back I wrote about my terminal-inspired portfolio https://dev.to/smngvlkz/a-calm-terminal-inspired-portfolio-focused-on-shipped-products-ga8 and the products it indexes. Two of those products lean on a language model: the portfolio terminal at smngvlkz.com https://smngvlkz.com that you can ask questions, and PayChasers https://paychasers.com , which generates OPTIONAL payment follow-up emails. Both started on Google's Gemini 3 Flash. Both now run on a model I host myself, with a fallback chain that keeps them alive when my hardware is not. This is the story of that move. The experiment that started it, why I committed to it, what the architecture looks like, the night it broke, and the parts I still have not solved. When Qwen 3.5 was announced, it made me curious about how far open models have actually come. Instead of reading benchmarks, I tested it the way I like to learn things, by running it. It began as a small experiment on my base Mac mini. I pulled Qwen through Ollama https://ollama.com just to see how capable the model would be running directly on a local machine. The results were far better than I expected. Good enough that I stopped thinking of it as a toy and started thinking about production. Gemini 3 Flash worked. The integration was a few lines and the quality was good. So this was not a "the API is bad" story. It was three smaller pulls that added up. The first was cost shape. PayChasers generates optional email drafts on demand, and every preview is a few thousand tokens of system prompt plus output. That is fine at zero users and a slow leak at volume. The marginal cost of an inference I run on a machine I already own is electricity. The second was control and privacy. I wanted to choose the model, pin it, and change the prompt contract without a provider deprecating something underneath me. I also did not love sending client names and payment context to a third party when I did not have to. The third was the economics of treating AI as infrastructure rather than a metered API. Once the model runs on hardware I control, it stops being a per-call expense and becomes shared infrastructure that multiple applications can use. The same inference server now powers two different products. That reframing is the whole point. The original plan was to host the model on Oracle Cloud using one of their free Ampere ARM instances in the Johannesburg region. If you have ever tried to get one, you know the struggle. Free tier ARM capacity is brutally limited, and after more than 200 automated retry attempts across two days, I still could not get one. So I pivoted. I wrote a lightweight reverse proxy, set up a Cloudflare Tunnel on one of my domains, and routed production traffic to the model running on my Mac at home. No ports opened on my home network, no static IP, just a tunnel from Cloudflare's edge to the machine on my desk. It was meant to be temporary. The Oracle instance eventually did come through, but by then the home setup was working well, so I did not throw it away. Instead I kept the Mac mini as the primary and gave Oracle a different job, the always on backup. More on that in a moment. This was a small full-circle moment. The Linux and infrastructure fundamentals I picked up during my bootcamp days and years of self teaching showed up in a real production context. Provisioning tunnels, configuring DNS, writing a proxy service, setting up persistent services. All of it coming together for something real. One deliberate decision was to keep the infrastructure simple. There are a lot of frameworks and agent systems appearing in the space right now. I focused on straightforward tooling that solved the problems I actually had. The Mac mini, exposed through Cloudflare tunnel, is the primary . It is fast but it is not always on, because it is a machine in my home. The Oracle Cloud VM is the fallback . It is slower and smaller, but it stays up around the clock. Every app talks to a thin client that knows about both, tries the fast one first, and silently falls back to the reliable one. Vercel app | v primary: Mac mini via Cloudflare tunnel --fail/timeout-- fallback: Oracle Cloud VM fast, not always on slow, always on This is the whole idea in one function. Hit the primary with a timeout. If anything goes wrong, the status, the timeout, a dropped tunnel, fall through to the fallback. js const PRIMARY URL = process.env.OLLAMA PRIMARY URL || "http://localhost:11434"; const FALLBACK URL = process.env.OLLAMA FALLBACK URL || PRIMARY URL; async function fetchWithFallback path: string, body: object : Promise