{"slug": "you-can-t-see-what-your-ai-actually-costs-so-i-built-the-meter-that-can", "title": "You Can't See What Your AI Actually Costs — So I Built the Meter That Can", "summary": "A developer built a governance plane for AI costs on a homelab, metering every LLM call to reveal that 85% of would-be costs were avoided through caching. The system provides visibility, defensible savings, and trust by tracking counterfactual baselines and self-testing math to prevent errors like the double-count bug that initially inflated savings by 2x.", "body_md": "Every team I talk to can tell me what their cloud bill was last month. Almost none can tell me what their AI calls cost — or, more importantly, what those calls *saved*. LLM spend gets filed under \"application cost,\" something the app team eyeballs once a quarter. That's the wrong mental model. Token spend is an **infrastructure cost**, and the moment you treat it like one — meter it, budget it, cache it, prove the savings — the economics change.\n\nSo I built a governance plane for the AI stack running on my homelab. Not a dashboard with a cost number on it. A system that answers three questions a finance partner would actually ask: *What did it cost? What would it have cost without our engineering? Can you prove that number is right?*\n\nThe answer to the third question turned out to be the hard part — and the most valuable.\n\n## The Core Fix\n\nTreat every LLM call the way a data center treats compute: consolidate repeated work, keep the cheap tier absorbing most of the traffic, and meter everything per consumer. The single biggest lever is **not sending the same work upstream twice**. When you measure that properly, you discover most of your savings already exist — you just couldn't see them.\n\nIn my case, once the meter was honest, it showed **85 percent of the would-be cost was being avoided**, almost entirely by caching the model never had to re-run. That's not a projection. It's a measured ratio between what the work *would* have cost at list price and what it actually cost.\n\n## What \"governance\" actually means here\n\nThree things, in business terms:\n\n**Visibility.** You cannot govern what you cannot measure. Every call is metered by who made it, which model answered, and whether it was served fresh or from cache — then rolled up into one view. Before this, \"AI cost\" was a vibe. Now it's a line item per consumer, per model, updated continuously.\n\n**Savings you can defend.** A cost number alone is useless for decision-making. The number that matters is the **counterfactual**: what this exact workload *would* have cost with none of the engineering — every token at full price, nothing served from cache. Savings is the gap between that baseline and reality. Putting both on the same chart turns \"we think caching helps\" into \"caching avoided 85 percent of a five-figure baseline, here's the curve.\"\n\n**Trust.** This is the part nobody talks about and everybody needs. A savings number that's wrong is worse than no number, because people make decisions on it.\n\n## The bug that proves the point\n\nEarly on, my system confidently reported a savings figure that was **roughly double the truth**.\n\nThe cause was mundane and exactly the kind of thing that ships to production every day: the usage logs replayed the same records in more than one place, and my first pass counted the replays as real spend. Nearly half the lines were duplicates. The dashboard looked great. It was also wrong by 2×.\n\nHere's the principal-engineer lesson, and it's free: **ratios survive, absolute numbers lie.** The efficiency *percentage* was correct the whole time, because the double-count inflated the baseline and the actual figure together — they scaled, the ratio held. But the headline dollar figure was fiction until I deduplicated the source.\n\nI only caught it because I went looking for it — and then I made sure I'd never have to rely on luck again. I wrapped the cost math in a **self-test**: a set of fixed inputs with known, hand-checked answers that runs in CI on every change. And a matching invariant check guards every single publish — if the numbers ever fail their own identity, the system refuses to write them rather than show a wrong one. The math is now gated like the code is gated. That's the difference between a metric and a number you can put in front of a finance partner.\n\n## Does the caching actually work? I measured it\n\nA claim like \"caching saves money\" is only honest if you've watched it happen. So I sent my system the same question twice, back to back, and timed it:\n\n**First time**(a question it had never seen): ~50 seconds, full model call, full cost.** Second time**(the identical question):** 4 milliseconds, zero tokens, byte-for-byte the same answer.**\n\nThat's not a rounding improvement. It's the same work, served roughly thirteen thousand times faster for nothing, for as long as the answer stays fresh. For anything repeated — the same question asked by ten different people, a report regenerated after a hiccup, an assistant re-reading the same material — the second request onward is free.\n\nThe honest caveat, because the honest version is more credible: this particular layer matches *exact* repeats. A reworded version of the same question still pays full price once. Catching rephrasings is a harder, fuzzier problem — it's solvable, and it's built, but I keep it deliberately conservative. Which brings me to the part I'm not going to hand you.\n\n## What I'm not publishing — and why that's the point\n\nThere's a real line between the **principles**, which are free, and the **implementation**, which is the leverage. This post is all principles:\n\n- Meter per consumer; treat spend as infrastructure.\n- Measure the counterfactual, not just the cost.\n- Let the cheapest tier absorb the most traffic.\n- One canonical price list, never two — divergence is invisible until it bites.\n- Gate the math the way you gate the code.\n\nThose are worth more than gold to anyone running LLMs at scale, and I'm giving them away on purpose. What I'm *not* publishing is how my retrieval, routing, and caching are actually wired — the specific shapes that make most of the bill disappear instead of a sliver of it. The principles tell you *what* to build; closing the distance to that number is engineering, and that engineering is the moat.\n\n## The business case, plainly\n\nIf you're running LLMs through a flat subscription, these numbers are notional — a value signal, not a bill. But flip the lens: **if you were paying metered API rates, an 85 percent efficiency ratio is your invoice cut by that much, with the quality unchanged** — because the savings come from not re-doing work, not from downgrading the model. Every novel, hard question still goes to the best model at full quality and full price; only the repeats are served free. And a quality bar guards what gets cached in the first place: cost reduction that degrades the product isn't a saving, it's a regression with good PR.\n\nThe shape of the ROI is the part that travels to any organization:\n\n| What it buys | Business value |\n|---|---|\n| Per-consumer metering | A real line item instead of a quarterly guess |\n| Counterfactual savings | \"We avoided 85 percent\" you can defend in a budget review |\n| Exact-repeat caching | Repeated work served free and instant (roughly 50 seconds → 4 milliseconds) |\n| Single canonical price list | No silent drift between what you charge and what you pay |\n| Self-tested math + alerting | Numbers a finance partner can trust; degradation pages you, it doesn't hide |\n\nI built this on a small three-node cluster in my house — a Raspberry Pi and two PCs — for the cost of my own time. The point was never the hardware; the governance layer is light enough to run almost anywhere. It was proving that **AI spend is governable infrastructure** — and that the difference between a team that knows its AI economics and one that guesses is a few well-placed gates and one honest counterfactual.\n\nThe 85 percent was always there. Most teams just never built the meter that could see it.", "url": "https://wpnews.pro/news/you-can-t-see-what-your-ai-actually-costs-so-i-built-the-meter-that-can", "canonical_source": "https://blog.r-lopes.com/posts/governing-ai-token-spend", "published_at": "2026-06-13 14:00:00+00:00", "updated_at": "2026-06-14 02:05:18.775056+00:00", "lang": "en", "topics": ["artificial-intelligence", "large-language-models", "ai-infrastructure", "ai-tools", "ai-ethics"], "entities": [], "alternates": {"html": "https://wpnews.pro/news/you-can-t-see-what-your-ai-actually-costs-so-i-built-the-meter-that-can", "markdown": "https://wpnews.pro/news/you-can-t-see-what-your-ai-actually-costs-so-i-built-the-meter-that-can.md", "text": "https://wpnews.pro/news/you-can-t-see-what-your-ai-actually-costs-so-i-built-the-meter-that-can.txt", "jsonld": "https://wpnews.pro/news/you-can-t-see-what-your-ai-actually-costs-so-i-built-the-meter-that-can.jsonld"}}