We Had 6 Features. 2 Were Eating Our Budget

wpnews.pro

Twelve months ago we were burning $4,200/month on AI infrastructure and could not tell you which feature was responsible for a single dollar of it.

Three engineers. Two monitoring tools. One Datadog dashboard. All of them answered the same question: how much.

Nobody was answering which feature. Which user. Which service.

Those are the only three numbers that matter when the bill starts climbing.

Our B2B SaaS platform handles enterprise document workflows. Six features hit GPT-4o regularly:

Contract Analyzer extracts clauses and flags risk on upload.

Executive Summary Generator runs on demand when a user asks for it.

Smart Search does semantic search across the document corpus.

Compliance Checker runs on every document save.

Inline Redline Suggester triggers on text selection.

Audit Trail Narrator generates human readable audit logs nightly.

Gut ranking for cost: Contract Analyzer first, Smart Search second, everything else minor.

Gut ranking for revenue impact: Contract Analyzer and Executive Summary driving most upgrade decisions.

We were wrong on both. Not slightly wrong. Completely wrong.

We instrumented every LLM call with feature level, service level, and user level tags. Forty eight hours of data:

Feature                      Monthly Cost    % of Total    Avg Cost/Call
-------------------------------------------------------------------------
Compliance Checker           $1,890          45%           $0.087
Audit Trail Narrator         $1,102          26%           $0.240
Contract Analyzer            $672            16%           $0.310
Executive Summary Generator  $294            7%            $0.180
Smart Search                 $168            4%            $0.021
Inline Redline Suggester     $74             2%            $0.009

Contract Analyzer, the one we had spent two weeks optimising, was third.

Two features we had never once discussed in a cost review were consuming 71% of the budget.

Compliance Checker ran on every document save.

Autosave interval: 30 seconds.

40 active enterprise users. That is 4,800 GPT-4o calls per hour. Every working hour. Every working day. Silently. Perfectly. Expensively.

No errors. No timeouts. No failed requests. No alert ever fired. Every log looked clean.

The feature was not broken. The design was. Nothing in our stack could tell the difference between working correctly and working expensively because those look identical at the response level.

Fix: Compliance check moved to manual trigger and document submission only.

Result: $1,890 to $190/month. One line of code.

The Audit Trail Narrator ran nightly. Reasonable on paper.

But it was generating a full GPT-4o narrative for every document that had any activity, including documents touched by automated processes, system integrations, and background jobs.

Roughly 60% of those documents had zero human readers of the audit logs. We were generating prose for an audience that did not exist. Every single night.

Fix: Scoped to human triggered activity only. Minimum three human edits before narration runs.

Result: $1,102 to $310/month.

Combined recovery: $2,592/month. No feature cut. No model downgrade. No user impact.

Both bugs had an error rate of zero. That is what makes them dangerous. Silent. Working. Expensive.

Once per feature attribution was running, we rolled it up into cost per user by plan tier.

Plan          Avg Cost to Serve/Month    MRR per Seat    Margin
----------------------------------------------------------------
Starter       $8.40                      $149            94%   ✓
Growth        $67.00                     $149            55%   ✓
Enterprise    $198.00                    $149            -33%  ✗

Our Enterprise customers were costing us $49 per seat per month more than they were paying.

They were our most active users, heavy on Compliance Checker and Audit Trail Narrator. Flat pricing made the loss invisible. Per user attribution made it impossible to ignore.

We repriced Enterprise to usage based. That conversation with customers was not difficult because the data was exact. Per user, per feature, per month numbers. No estimates. No gut feel. Nothing to argue with.

Beyond feature and user, attribution broke down cost by service, specifically which microservice was originating each LLM call.

Our document-processing-service was making compliance calls that our compliance-service was also making downstream on the same document. We were double billing ourselves on overlapping prompts. Completely invisible until you track at the service layer.

Another $180/month recovered from a bug with zero user facing symptoms.

Three dimensions. Three different bugs. Any single dimension alone would have missed two of them.

We had Datadog. We had CloudZero.

CloudZero is built for cloud infrastructure cost allocation. AWS, GCP, Azure broken down by team and resource tag. It does that well. But it has no native AI or LLM tracking. OpenAI, Anthropic, Cohere are not in its data model. You end up back on manual tagging workarounds and provider dashboards that only show totals.

The gap is not about dashboards or visualisations. It is about where in the stack the data gets captured. We needed instrumentation sitting between our application code and the provider API, tagging every call at the moment it happens with which feature triggered it, which user caused it, and which service originated it. Then surfacing cost per feature, cost per user, and cost per service in one place, with budget alerts that fire before the bill arrives, not after.

That is the layer standard monitoring tools do not reach.

Can you answer this in under 60 seconds:

Which feature is your most expensive to run?

Which users are most expensive to serve?

Which service is generating the most LLM spend?

Is any of that negative for your unit economics at current pricing?

If you would have to dig for any of those answers, the risk is not in your monitoring. It is in the layer your monitoring does not reach.

Total spend is a receipt. Attribution is a map.

We had the receipt for fourteen months before we got the map.

We use CostReveal for this. The SDK wraps your existing provider calls and tags every call by feature, service, and user. One dashboard surfaces all three dimensions with real time budget alerts and hard limits that stop runaway spend before calls complete. Took one evening to instrument. Both bugs above showed up within 72 hours.

Docs at docs.costreveal.com if you want to go straight to setup.

Have you ever found a feature quietly bleeding budget with a perfect error rate? Or discovered a plan tier you were losing money on? Drop it in the comments. Curious how common this pattern actually is.

source & further reading

dev.to — original article 🌍 Exposing Your Hermes Agent to the Internet with Tailscale Funnel (Safely) Why Your Browser Should Do the Heavy Lifting: A Guide to Local Data Sanitization Jarvis AI Platform: Implementing Semantic Memory Retrieval with pgvector

We Had 6 Features. 2 Were Eating Our Budget

Run your AI side-project on zahid.host