TL;DR. If your LLM bill is one line item on a cloud invoice, you cannot answer "which team spent that." We fixed this by tagging every gateway span with team.id
, project.id
, and feature.id
, plus the OpenInference token-count attributes, shipping those spans through an OTel collector into Tempo, and rolling cost up per team with TraceQL in Grafana. The payoff that sold it internally: one team's monthly spend quietly went from a few hundred dollars to over a thousand because of a retry loop, and the org-level dashboard never flinched. The per-team view caught it in a day. Below is the wiring, the collector config, the rollup query, the alert, and the attributes I tried and threw away.
Most teams already collect LLM telemetry. Spans exist, tokens get counted, traces land somewhere. What is missing is the dimension that finance and eng leads actually ask about: who owns this spend. The provider invoice gives you one number per month per API key. If you share keys across services (most people do at some point), that number is useless for chargeback. You cannot tell the platform team's spend from the support-bot team's spend.
So the design goal was narrow. Every LLM call has to carry enough labels that I can group spend by team, by project under that team, and by feature inside that project. Three levels. No more, because deeper than feature and nobody reads the dashboard. I standardized the whole pipeline on OpenTelemetry and OpenInference, and I will state the one opinion plainly: I want the labels, the wire format, and the storage to be things I can swap without rewriting instrumentation. We tag spans with open semantic conventions so the day we change a backend or a dashboard tool, the gateway code does not move. That is a portability decision, not a verdict on anyone's product.
Tag at the gateway, not in each service. We run an LLM gateway (every call to every provider goes through it), so it is the one place that sees model, token counts, and request context together. A new service gets attribution for free as long as it routes through the gateway and forwards the three context headers.
The cost-math group comes straight from OpenInference semantic conventions: llm.model_name
, llm.token_count.prompt
, llm.token_count.completion
. The attribution group is custom, set from request headers: team.id
, project.id
, feature.id
. Cost is not a span attribute. I compute it at query time from token counts and a small price lookup, because prices change and I do not want last quarter's spans frozen at last quarter's rates.
| Attribute | What it buys you | Keep or drop |
|---|---|---|
team.id |
||
| Top-level chargeback. The number a director asks for. | Keep | |
project.id |
||
| Splits a team's spend across its services. | Keep | |
feature.id |
||
| Which feature drove a spike inside a project. | Keep | |
llm.model_name |
||
| Lets you weight tokens by per-model price. | Keep | |
llm.token_count.prompt |
||
| Input side of the cost. | Keep | |
llm.token_count.completion |
||
| Output side. Usually the expensive half and the one that runs away. | Keep | |
user.id |
||
| Per-user spend, in theory. A privacy liability in traces. | Drop | |
request.id |
||
| Already covered by the trace and span IDs. | Drop |
OTLP in, batch, set anything the gateway missed, Tempo out. The one processor worth calling out is transform
: I use it to backfill team.id
with a sentinel when a service forgets the header, so unlabeled spend shows up as unattributed instead of vanishing. Cost with no label is cost you will never find.
receivers:
otlp:
protocols:
grpc: { endpoint: 0.0.0.0:4317 }
http: { endpoint: 0.0.0.0:4318 }
processors:
batch: { timeout: 5s, send_batch_size: 1024 }
transform/attribution:
trace_statements:
- context: span
statements:
- set(attributes["team.id"], "unattributed") where attributes["team.id"] == nil
- set(attributes["project.id"], "unknown") where attributes["project.id"] == nil
- set(attributes["feature.id"], "unknown") where attributes["feature.id"] == nil
exporters:
otlp/tempo:
endpoint: tempo:4317
tls: { insecure: true }
service:
pipelines:
traces:
receivers: [otlp]
processors: [transform/attribution, batch]
exporters: [otlp/tempo]
Two notes from running this. Put transform
before batch
so the backfill happens per span while the data is still cheap to touch. And keep the price table out of the collector. I tried encoding per-model rates as collector attributes once. Every price change became a config deploy, and the rates drifted out of sync with what we were actually billed. Pricing lives next to the query now.
Tempo stores spans, not dollars. So the rollup is two steps: TraceQL pulls token sums grouped by the attribution attributes, and a small price map turns tokens into cost downstream. I start from this, which aggregates output-token counts (the number I watch most, because completion tokens are usually where the money and the runaways are):
{ .team.id = "support-platform" && .llm.token_count.completion > 0 }
| select(.project.id, .feature.id, .llm.model_name, .llm.token_count.prompt, .llm.token_count.completion)
| by(.team.id, .project.id, .llm.model_name)
| sum(.llm.token_count.completion)
Drop the team.id
filter and group by it instead for the all-teams board. The grouping by llm.model_name
matters: a mini-tier model and a frontier model can differ by more than an order of magnitude per token, so summing raw tokens across models hides which team is expensive because of volume versus model choice. The dollar step is deliberately dumb: a lookup from llm.model_name
to input-price and output-price per thousand tokens, multiplied through, summed per team. Keeping it dumb and external is what lets me re-price history when a provider changes rates.
Cost attribution is reporting. The thing that earns its keep is the page. The rule I run is week-over-week on output tokens per team: if this week's completion-token total for any team is more than 2x the same window last week, page. Output tokens, not input, because the runaway failure modes (retry storms, an agent that loops, a prompt-chaining bug that re-asks) all show up as generation volume first. Why 2x week-over-week and not a fixed dollar ceiling: a fixed ceiling either pages constantly for your big teams or never fires for your small ones. A relative jump normalizes across team size on its own. The team whose spend doubled in the story above would have tripped a 2x rule on day one. It did not trip our dollar alert because the absolute number was still small against the org total. Small against the org, doubled for the team, is exactly the blind spot per-team attribution exists to close. Route it to whoever owns the team's budget, not a shared channel where it gets ignored.
** user.id.** Per-user spend sounds useful and is occasionally asked for. But putting a user identifier on every span means every trace is now PII, and your whole tracing backend inherits the retention, access, and deletion obligations that come with that. The attribution win did not come close to justifying the compliance surface. Dropped it, have not missed it.
** request.id.** Pure redundancy. A trace already has a trace ID and every span has a span ID. Anywhere I thought I wanted it, the trace ID was already there and already correct.
The pattern in both: an attribute is only worth tagging if it answers a question the cheaper attributes cannot, and if its cost (privacy, plumbing, drift) is lower than that answer is worth.
Why compute cost at query time instead of writing a cost attribute on the span? Prices change and I want to re-cost history when they do. A cost attribute freezes the rate at write time.
Do I need the gateway, or can each service tag its own spans? You can tag per service. I prefer the gateway because it sees model and tokens and request context in one place, so a new service gets attribution by routing through it and forwarding three headers.
Why Tempo specifically? It is what we run, and TraceQL's aggregation over span attributes does the rollup I need. The attribute conventions are OpenInference, so the labels are not tied to Tempo. The point of standardizing on open conventions is that this choice is reversible.
What if a service forgets the attribution headers? The collector backfills unattributed. The spend still shows up, just in a bucket whose name tells me to go fix the instrumentation.
Is week-over-week 2x too noisy? For steady traffic, no. For genuinely spiky workloads, raise the ratio or widen the comparison window. I bias toward a slightly noisy page over a silent doubling.
Caching breaks the token-to-cost math. Cached prompt tokens bill at a different rate (sometimes free), and I do not yet tag cache hits cleanly enough to price them right. Streaming and cancelled generations: if a client disconnects mid-stream, what is the honest output-token count, and does the provider bill for tokens generated after the cancel? Feature-level granularity has a ceiling, and I keep wanting per-prompt-version attribution but every level deeper is one more label nobody reads. And whether 2x week-over-week should itself be per-team, since some teams are spiky by nature and one global ratio serves both imperfectly. If you have wired cached-token pricing into a span-based cost model in a way that survives a provider changing its cache rates, I want to hear how.