We Get AI Costs Under Control

An AI consultant revealed that a client spent nearly half a billion dollars in a single month due to unset usage limits on AI licenses, highlighting the need for FinOps principles adapted to AI. The article argues that token-based costs require transparency, attribution, and guardrails to prevent runaway spending, especially with agentic workflows.

Why tokens are becoming the new unit of cost, and how transparency, clear limits, and empowered teams keep it under control One example caused a stir across the industry: an AI consultant told the news outlet Axios that one of their clients burned through roughly half a billion dollars in a single month because no usage limits had been set on the employees’ AI licenses. The case sounds extreme, and at that scale it is the exception. The underlying pattern is not. On a smaller scale, many organizations are experiencing it right now: a bill jumps from a few hundred to several thousand euros a month, without any alarm going off in the system or any way to tell which service or which user caused the increase. Anyone familiar with FinOps from the cloud world recognizes the problem immediately. It is not the individual expensive token that drives the cost, but the lack of visibility and the absence of clear boundaries. This is exactly where FinOps comes in: it connects financial governance with operational transparency, so that business units, IT, and finance can make decisions on a shared data basis. The same logic applies to AI. Only the unit of cost has changed. The good news up front: there is now an open standard for this transparency. Just as logs, metrics, and traces became established in classic monitoring, the OpenTelemetry GenAI Semantic Conventions provide a uniform vocabulary to capture AI consumption in a vendor-neutral way and attribute it to individual sources. This standard is the common thread running through the sections that follow, from attribution through architecture to ongoing controlling. In short, FinOps for AI rests on five levers: first, transparency through token attribution, meaning the assignment of every model call to user, team, and feature; second, an AI proxy as a central control point for all AI traffic; third, clear limits and guardrails that prevent uncontrolled costs; fourth, continuous optimization through the right model choice, lean calls, and caching; and fifth, empowered teams that understand and own their costs. The following sections walk through these levers one by one. Classic software is mostly billed per license or per seat. Costs are predictable and rarely change between two billing periods, and procurement runs through a clearly defined purchasing process. AI behaves differently. Here the unit of cost is the token, and costs arise anew with every single call, depending on the length of the input, the length of the response, and the chosen model. A parallel to the cloud is decisive here, and many underestimate it: the purchasing decision is democratized within the organization. Just as the cloud shifted the procurement of infrastructure out of purchasing and into the hands of engineering, AI distributes spending authority even more finely. No longer does a central body decide on costs; instead, every developer triggers real spending with every prompt, every model choice, and every agent they start. The frequency at which cost-relevant decisions are made is therefore orders of magnitude higher than with classic software. The effect is especially pronounced with agentic workflows, that is, AI systems that handle multi-step tasks autonomously. Such processes consume a multiple of a single model call because they work in loops, repeatedly carry context along, and generate intermediate steps. A single careless loop or an unbounded background job can thus cause significant costs in a short time. Whereas a cloud misconfiguration often unfolds its effect over days, an agent running out of control can become costly within minutes. This shifts the central question. It is no longer whether AI should be used, but how its consumption can be made visible, attributed to individual sources, and limited when necessary. The most important idea for decision-makers is this: AI cost control is not an entirely new problem. It is FinOps with a finer granularity and a much higher velocity. Anyone who has already established a tagging strategy in the cloud already holds the decisive mindset. It only needs to be transferred to the new unit of cost. In cloud FinOps, we attribute every resource via tags such as cost center, team, environment, or project. AI needs exactly the same discipline, only now the tags hang on every model call: user, team, feature, or workflow. Without this attribution at the moment of the call, the provider’s later aggregated bill can no longer be broken down. Anomalies cannot be explained, and the business value of an individual feature cannot be calculated. Three parallels are particularly insightful here. First, the biggest practical problem in both worlds is coverage. In every cloud project, significant portions of resources are initially untagged and end up in an unallocated bucket. With AI the same question arises: does every single call really carry an identity? Second, in both cases the lever lies in enforcement at the source. In the cloud this means an untagged resource violates a policy. With AI it means a call without attribution is not let through in the first place. Third, the real strategic advantage emerges when the same taxonomy runs across both worlds. Then, for the first time, it becomes possible to answer what a feature costs in total, infrastructure and AI together. This is exactly the common denominator pursued by the FinOps Open Cost and Usage Specification, FOCUS for short, which increasingly brings AI consumption data into a uniform format and thus forms the natural connection to existing tools such as Apptio. How these building blocks fit together is shown by the following architecture overview, before we look at the individual parts in more detail. Figure: Building-block view of the AI proxy — consumers on the left, the proxy in the center with its modules for identity, usage metering, limits, routing, and telemetry export, the providers including on-prem on the right, and Dash0 below as the analysis system Before you can optimize, you have to measure. As mentioned at the outset, an open standard has established itself for this with the OpenTelemetry GenAI Semantic Conventions. Concretely, they define a uniform vocabulary for AI telemetry, for example for the model used, the provider, the type of operation, and the consumption of input and output tokens. The advantage for organizations is considerable. Once you instrument against this standard, you are not tied to a single provider but can freely send the data wherever it is to be analyzed. To attribute costs to individual users, your own business attributes are added to these standardized fields, that is, user identifier, team, feature, or cost center. It is on exactly these attributes that aggregation later happens. It is the same idea as the trace ID or correlation ID in classic logging, with which an entry can be unambiguously assigned to a request or business process. Only here the identifier does not serve troubleshooting but economic attribution. In practice, this looks like wrapping every model call in a span that carries the standardized fields and your own attribution attributes: python from opentelemetry import trace tracer = trace.get tracer "ai-proxy" with tracer.start as current span "chat" as span: Standardized GenAI attributes OpenTelemetry Semantic Conventions span.set attribute "gen ai.operation.name", "chat" span.set attribute "gen ai.provider.name", "anthropic" span.set attribute "gen ai.request.model", "claude-sonnet-4-6" Your own attribution attributes — the basis of every cost breakdown span.set attribute "enduser.id", "k.herings" span.set attribute "team.name", "customer-support" span.set attribute "feature.name", "support-rag" span.set attribute "cost center", "CC-4711" response = client.messages.create ... Record consumption after the call usage = response.usage span.set attribute "gen ai.usage.input tokens", usage.input tokens span.set attribute "gen ai.usage.output tokens", usage.output tokens The gen ai. fields follow the open standard and are therefore identical across any compatible analysis system. Fields such as enduser.id or team.name are the business-level addition along which the bill can later be broken down by user, team, or feature. What matters is that this attribution is set at the time of the call, because it cannot be reconstructed afterward from the provider’s aggregated bill. An important note from a data protection perspective: pure cost monitoring does not require storing prompt content. The metadata is sufficient, that is, token counts, costs, model, latency, and the attribution attributes. Especially for organizations with high requirements for data protection and data residency, this separation is decisive. The most practical architecture for bringing visibility and control together is a central passage point for all AI traffic, often referred to as an AI gateway or AI proxy. Instead of each application talking directly to the providers, the traffic runs through this single point. Applications and tools such as development environments, chat interfaces, agentic pipelines, or in-house services do not receive real provider keys, but virtual keys that are mapped internally to user, team, or cost center. Every call is thus automatically attributed, without the individual developer having to do anything extra. The building-block view shown earlier makes clear how measurement and enforcement come together in this one component. The decisive point is that measurement and enforcement sit in the same building block. The proxy captures consumption, model, cost, and latency, enforces limits, can route simple tasks to smaller models, and exports the telemetry to an analysis system following the open standard. With this, transparency is no longer a downstream analysis at the end of the month but part of every single call. A welcome side effect is that this path also brings previously uncontrolled direct usage, so-called shadow AI, into a governed environment. Figure: Health of the AI proxy in Dash0 — throughput, latency, and error rate per service as the basis for anomaly detection Visibility alone does not prevent a cost explosion. It is the prerequisite for being able to define sensible boundaries in the first place. In practice, a clear maturity pattern has emerged. Teams first introduce limits on the number of requests, add limits on token consumption after the first surprise bill, and introduce a hard budget limit per period and team after the second. These limits work together on several levels. Request-count limits protect the infrastructure. Token-based limits steer the actual consumption, since tokens correlate directly with compute effort and cost. Budget limits finally prevent unexpected load spikes from batch processing or agent loops from leading to untenable bills. At the gateway, this can be declared per virtual key: Limits per virtual key example: Customer Support team key: support-rag limits: requests per minute: 120 protects the infrastructure tokens per minute: 200000 steers the actual consumption budget: period: monthly soft limit eur: 5000 soft threshold → alert hard limit eur: 6000 hard limit → calls are rejected circuit breaker: cost velocity eur per min: 20 stops runaway agents within minutes Particularly effective is the combination of soft thresholds that warn early and a hard safety mechanism that intervenes when cost velocity becomes conspicuous. An agent running out of control that consumes a multiple of its planned budget per minute is thus stopped within minutes, rather than being discovered hours later by the finance department. This is exactly the mechanism that was missing in the case described at the start. Once transparency and guardrails are in place, the actual optimization begins. The consistently largest lever is the right model choice. Not every task needs the largest and most expensive model. In the example shown above, the largest model class causes a cost share of 44 percent, although it accounts for only 18 percent of the calls. If part of these calls is routed to a mid-tier model, the average price drops considerably without noticeably affecting quality. In practice, significant portions of the cost can be saved this way. How large the difference between the classes is becomes clear from a look at the list prices and the cost of a single example request: Figure: Cost per model class — list prices per million tokens and the cost of an example request with 10,000 input and 2,000 output tokens as of June 2026 The numbers make the lever tangible. The same example request costs roughly five cents on the largest model, around six cents on a mid-tier model, and about two cents on a small model. Between a premium model and a budget model there is therefore, depending on the provider, a factor of more than ten. With a handful of calls this is irrelevant, but extrapolated to hundreds of thousands or millions of calls per month, the model choice decisively determines the total bill. The point is therefore not to force the cheapest model across the board, but to route each task to the cheapest model that still reliably delivers the required quality. These values change regularly in the market, which is why the assignment should be understood as an ongoing task rather than a one-time decision. Two further levers come on top. Lean inputs, that is shorter system instructions, capped output lengths, and a deliberately reduced context, lower consumption immediately. And caching recurring requests avoids paying again and again for the same question. Notably, the combination of model choice, lean calls, and caching alone achieves an entry-level target on the order of thirty percent savings in many projects. It is the same experience we know from cloud optimization projects. A question that comes up in almost every conversation is whether to run your own models, that is on-prem or in your own cloud. The intuitive assumption is that self-operation is cheaper. An honest look at the total cost qualifies this, however. On top of the pure compute power come efforts for operation and maintenance, for unused but paid-for capacity, for power and cooling, and for regular model updates. For most organizations, access via APIs is therefore initially the more economical option, as long as the consumption volume is not very high. The actual driver for self-operation is therefore rarely price, but compliance and data sovereignty. Where regulatory requirements, data residency, or particular confidentiality demand it, self-operation can be the only permissible option, regardless of the pure cost calculation. In practice, a hybrid approach often makes sense, in which predictable and sensitive loads are processed locally and the rest is covered via APIs. What matters is that this decision is made on the basis of reliable consumption data and not from gut feeling. Here too, transparency is the prerequisite for a sound decision. From our FinOps experience we know that the most sustainable effect does not come from the dashboard, but from the people who cause the costs. A team that understands that a forgotten environment or an oversized resource costs real money makes a thousand small decisions better. With AI this applies in an intensified form, because here it is not a one-time infrastructure decision that tips the scales, but every developer making a cost decision with every call: which model, how much context, whether an agent is really needed. This is precisely why the guiding principle of every training is that token consumption is a measure of activity, not of productivity. More consumption is not a sign of progress, but a cost factor that must justify itself. How important this distinction is becomes clear from a well-known counterexample: a large company had internally introduced a leaderboard that made developers’ token consumption visible. The result was that employees began performing unnecessary tasks in order to climb the list. The leaderboard was abolished again. It was not a tooling problem but an enablement problem, because the wrong metric had been elevated to a virtue. An effective program therefore works on several levels. Developers learn practical cost hygiene, that is when a small model suffices, why long context costs anew with every call, what an agentic process actually consumes, and how to read their own costs in the first place. Team leads learn to understand their budget, to read analyses, and to recognize anomalies in their own area, whereby cost awareness is established as a shared responsibility and not as a punishment. And decision-makers learn to set sensible guardrails without stifling innovation, and to incentivize correctly. From this follows a clear sequence that holds the whole approach together: visibility leads to problem awareness, and problem awareness leads to changed behavior. Without visibility there is no awareness, and without awareness no lasting behavioral change. The dashboard is not the goal here, but the tool with which people begin to see cost decisions that were previously invisible. AI cost management is not a one-time project but an ongoing practice. Consumption, models, and usage patterns change, and so the steering must continuously adapt as well. What matters is therefore ongoing monitoring that breaks down costs by user, team, and feature, detects anomalies, and makes the effectiveness of the introduced measures visible. In practice, specialized tools support this process. Before we look at the concrete tool, it is worth looking at the target picture. Conceptually, an AI cost dashboard should answer at a glance who consumes how much, where unallocated costs lie, where an anomaly occurs, and how the model mix is distributed. Figure: Schematic target picture of an AI cost dashboard — cost by team with an unallocated bucket, an anomaly indicator, and the model mix as the routing lever Just as in classic monitoring of logs, metrics, and traces, AI telemetry can also be bundled and correlated in a central place. A platform like Dash0 takes in the standardized telemetry data and makes it possible to break down costs by user, team, or feature, to track them over time, and to attribute anomalies early to the responsible applications. Because the instrumentation is based on an open standard, the analysis system itself remains interchangeable, which preserves independence from individual providers. The schematic target picture thus becomes a real analysis: Figure: Cost attribution in Dash0 — breakdown by team, user, and feature, an unallocated bucket as a coverage metric, and an example of a detected runaway agent In this example, several of the principles described above become visible at a glance. Costs can be attributed to the causing areas via the team attribute, while an unallocated bucket shows how high the share without a set tag is and thus serves as a coverage metric. The breakdown by feature reveals where the highest cost per call arises, and the model mix makes the routing lever tangible by showing which share of the cost falls on the largest models. Finally, the example of a detected runaway agent illustrates how a soft warning threshold and a hard budget limit work together to stop a derailing process within minutes. Managing AI costs is not a fundamentally new field, but the consistent extension of what FinOps has been delivering in the cloud world for years. The unit of cost has shifted from resource consumption to the token, the speed at which costs arise has increased, and the number of people making cost decisions every day has grown considerably. The effective levers, however, remain the same: transparency through consistent attribution, clear guardrails through limits, continuous optimization through the right model choice and lean calls, and, as the decisive factor, empowered teams that understand their costs. If you no longer want to treat AI costs as an unavoidable side effect of innovation, but as a controllable variable in your FinOps practice, it is worth looking at the right architecture, sensible limits, and a well-thought-out enablement program. Get in touch with us for a no-obligation conversation. Together we will examine how your specific situation can be analyzed and which measures will contribute the most in your organization. Note: This post is intended for orientation and reflects our assessment at the time of publication. Prices, model names, and technical details in the AI space change quickly, and despite careful research, individual statements may contain errors or may have become outdated. The example data, architectures, and dashboards shown are illustrative and do not replace an analysis tailored to your situation. If you would like to assess your specific AI cost situation, we are happy to act as a sparring partner with you on architecture, limits, and enablement. Contact us for a no-obligation conversation. Senior consultant Optimize alignment between IT and business with expert advice and clear strategies.