We went from $620 to $2,480 in 23 days.
No new features shipped. No traffic spike. Zero error alerts. Deployment logs were clean. Five engineers staring at dashboards that gave us totals and nothing else.
What we had was a receipt. What we needed was a map.
Here are five things hiding inside your LLM bill right now that your monitoring stack almost certainly cannot show you.
Every provider dashboard shows you model level totals. GPT-4o: $X. Claude: $Y.
That number is useless for debugging.
What you need is feature level attribution. Which product feature triggered each call. In our case the batch report generator was responsible for 74% of total spend. We had been optimising the other two features for two straight weeks because they felt expensive.
Here is what 48 hours of real attribution data looked like:
| Feature | Monthly Cost | Share |
|---|---|---|
| Batch Report Generator | $1,847 | 74% |
| Document Summariser | $421 | 17% |
| Inline Suggestion Engine | $212 | 9% |
I had been optimising the wrong two features the entire time.
What to do: Instrument every LLM call with a feature tag at the point of the call. Not in post-processing. Not in a weekly report. At the call itself. The data only means something if it captures what triggered the request.
This one does not feel like a cost problem at first. It feels like a pricing problem later.
Once we had feature level attribution running we rolled it up per user per plan tier. What came back changed how we run the business:
| Plan | Avg Cost to Serve / Month | MRR per Seat | Margin |
|---|---|---|---|
| Starter | $3.20 | $49 | 93% ✓ |
| Growth | $31.00 | $49 | 37% ✓ |
| Enterprise | $89.00 | $49 | -45% ✗ |
Our most active users were our most unprofitable users.
Flat pricing made this invisible for 14 months. Per user attribution made it impossible to ignore in 48 hours.
We repriced Enterprise to usage based. That conversation with customers was not difficult because the numbers were exact. Per user. Per feature. Per month. Nothing to argue with.
What to do: Roll up cost per user once you have feature attribution running. The unit economics gap only becomes visible at that layer. If you are on flat pricing and your power users are also your heaviest LLM users, there is a real chance you are losing money on your best customers right now.
This one is invisible until you track at the service layer.
Our document-processing-service was making compliance calls. Our compliance-service was also making compliance calls downstream on the same document. We were paying twice for the same prompt on the same input. Every single time.
Zero user facing symptoms. Zero errors. Zero alerts. $180 a month just gone.
Three dimensions matter: feature, user, service. Any single dimension alone misses the other two bugs. We had one dimension for 14 months and thought we had visibility.
What to do: Tag every call with the originating service name alongside the feature and user. When you break cost down by service you will find overlapping calls that look completely normal in isolation but are duplicates at the system level.
This is the most dangerous category on this list.
A feature that errors gets flagged. A feature that succeeds too often, on a broken trigger, gets nothing.
Our compliance checker ran on every document save. Autosave interval: 30 seconds. 40 enterprise users. That is 4,800 GPT-4o calls per hour. Every working hour. Every working day.
No alert ever fired because nothing was wrong at the response level. Every call succeeded. Every log looked clean. The bug was in the trigger design, not the call itself.
Fix: moved compliance check to manual trigger and document submission only.
Result: $1,890 to $190 per month. One line of code. No feature removed. No model downgraded. Zero user impact.
What to do: Look at call frequency per feature, not just cost per call. A feature that runs 2,000 times a day with a $0.09 average call cost is a $5,400 a month feature. That number only appears when you are rolling up cost by feature over time, not inspecting individual requests.
This one took us the longest to understand.
We had Datadog. We had the OpenAI usage dashboard. We had CloudWatch. All of them answered one question: how much.
Nobody was answering which feature, which user, which service.
Those are completely different questions. Infrastructure monitoring watches infrastructure. It knows a request succeeded. It has no concept of which product feature triggered it, which customer caused it, or whether that success was profitable given your pricing.
The gap is not about dashboards or visualisations. It is about where in the stack the data gets captured. You need instrumentation sitting between your application code and the provider API, tagging every call at the moment it happens with what triggered it.
Standard monitoring tools do not reach that layer. That is not a criticism of those tools. They were not built for it. But if you are running LLM features in production and relying only on infrastructure monitoring, you have blind spots that look exactly like working correctly.
What to do: Ask yourself one question. Can you answer this in under 60 seconds:
Which feature is your most expensive to run, for which users, and is that number healthy for your unit economics at your current pricing?
If you would have to dig for any part of that answer, the risk is not in your monitoring. It is in the layer your monitoring does not reach. After 23 days of climbing bills and wrong guesses, a teammate dropped CostReveal in our Slack. The SDK wraps your existing provider calls and tags every call by feature, service, and user. One dashboard surfaces all three dimensions with real time budget alerts that fire before the bill arrives.
Setup took one evening. Real data showed up in 48 hours. Both the autosave bug and the double-calling service bug surfaced within 72 hours of instrumentation.
Docs at docs.costreveal.com if you want to go straight to setup. Total spend is a receipt. Attribution is a map.
We had the receipt for 14 months before we got the map.
Have you found a silent cost bug like this? A feature working perfectly and quietly draining budget with zero alerts? Drop it in the comments. Genuinely curious how common this pattern is.