# How I Cut My LLM Costs by 90% Without Changing My App Logic

> Source: <https://dev.to/mervindublin/how-i-cut-my-llm-costs-by-90-without-changing-my-app-logic-278f>
> Published: 2026-05-21 20:44:33+00:00

How I Cut My LLM Costs by 90% Without Changing My App Logic
There’s a particular kind of dread that comes with checking your OpenAI billing dashboard mid-month.
I’ve been building a news automation hub that runs 14 editorial workspaces — summarizing, rewriting, fact-checking, SEO-tagging, and translation pipelines around the clock.
The AI layer was already fairly optimized:
- Groq
- Gemini Flash
- DeepSeek
- OpenRouter
- provider rotation
- fallback logic
But the final fallback was still OpenAI, and once rate limits hit, costs climbed faster than expected.
What I needed wasn’t more routing logic.
I needed a smarter endpoint.
The Problem
My setup already rotated between multiple providers, but the architecture had a weakness:

``` php
Provider exhausted
    -> fallback
        -> OpenAI
            -> credits disappear
```

The more providers I added, the messier things became:
- more API keys
- more retry logic
- more conditional branches
- more provider-specific handling
I was optimizing infrastructure with application code.
That was the mistake.
The Fix
After digging through self-hosted AI tooling, I found freellmapi
.
It’s a lightweight OpenAI-compatible proxy that automatically routes requests across multiple free-tier LLM providers:
- Groq
- Cerebras
- SambaNova
- Cloudflare Workers AI
- GitHub Models
- OpenRouter free models
- and others
Combined free-tier capacity: roughly 800M tokens/month.
The interesting part is that the routing happens inside the proxy — not inside your app.
My Integration
The integration took less than an hour.
1. Deploy the proxy
I ran it on my existing VPS:
- Node.js 20
- ~40MB idle RAM
- localhost only
2. Add provider credentials
I added:
- Groq key
- Cloudflare credentials
- OpenRouter key
inside the admin panel.
3. Point my app to a single endpoint

``` js
const client = new OpenAI({
  baseURL: "http://localhost:3001/v1",
  apiKey: process.env.LOCAL_ROUTER_KEY
});
```

That was basically it.
The important detail:
I stopped specifying models for non-critical tasks.
Instead of forcing a specific provider, I let the proxy auto-route requests to whatever free provider was currently available.

``` php
App
  -> freellmapi
      -> Groq
      -> Cloudflare Workers AI
      -> Cerebras
      -> SambaNova
      -> OpenRouter
```

If Groq rate-limited:
- another provider picked up the request
If a provider became slow:
- routing shifted automatically
My application code never needed to know.
The Result
Within 24 hours:
- OpenAI usage dropped by ~90%
- background AI tasks became almost entirely free-tier
- no additional retry logic was needed
Most importantly:
I removed provider chaos from my application layer.
What I Learned
When engineers hit rate limits, the instinct is usually:
- add more providers
- add more fallback logic
- add more code
But sometimes the better solution is adding an abstraction layer that absorbs the complexity for you.
Another realization:
Most AI tasks do not require a specific premium model.
For:
- summaries
- tagging
- drafts
- translations
- background enrichment
…almost any decent modern 70B model works fine.
Caveats
Free-tier infrastructure has tradeoffs.
Some providers:
- have cold starts
- introduce latency spikes
- become temporarily unavailable
For real-time user-facing chat systems, you should test failover carefully.
For async pipelines and batch jobs, though, it’s been surprisingly solid.
Also:
run this on infrastructure you control.
A proxy like this handles upstream API keys — don’t hand that responsibility to random hosted services.
Final Thought
The biggest optimization wasn’t changing models.
It was removing complexity from the layer that had to manage them.
