Trace Sampling for LLM Apps: Keep the Spans That Matter, Drop the Rest

An engineer details how to implement tail sampling for LLM application traces to control observability costs while preserving critical data. The approach uses OpenTelemetry's tail sampling processor to keep traces with errors, high latency, high cost, or specific evaluation tags, dropping only the rest probabilistically.

You ship an LLM feature. Traffic grows. The observability bill grows faster. Every chat turn is a trace. Every trace carries the full prompt, the full response, the retrieved chunks, the tool calls, the token counts. A single agent run can be twenty spans, each one fat with text. At a thousand requests a second, you are writing terabytes a day to a backend that charges per gigabyte ingested and per gigabyte stored. The first instinct is to store everything, because you never know which trace you will need. That instinct is correct right up until the invoice arrives. So you sample. The question is how. Sample wrong and you drop the one trace the customer is complaining about. Sample right and you keep the spans that matter and drop the noise. This is about getting the policy right. There are two places you can make the keep-or-drop decision. Head sampling decides at the start of the trace, before you know how it ends. The root span flips a weighted coin: keep 10% of traffic, drop the other 90%. The OpenTelemetry SDK can do this with a built-in sampler, no extra infrastructure. It is cheap because the dropped traces are never serialized or sent anywhere. The catch: the decision is blind. A request that ends in a 500, a timeout, or a $4 token bill has the same 10% chance of survival as a boring cache hit. You will drop most of your errors. For an LLM app, where the interesting failures hide inside HTTP 200s, that is the wrong trade. Tail sampling decides at the end, once the whole trace has finished and you can see what happened. It runs in a collector, not the SDK, because the collector buffers all the spans of a trace until the root closes, then applies rules. Now you can say: keep every trace that errored, keep every trace slower than 10 seconds, keep every trace that cost more than a dollar, and keep 5% of the rest. You pay to buffer everything briefly, but you only store what you decide to keep. For LLM applications, tail sampling is where the value is. The signal you care about is exactly what head sampling throws away at random: the slow call, the expensive call, the failed tool sequence. Before any config, write the policy down in words. Mine, for a typical LLM product: The order matters. The rules are evaluated as a chain of "if any of these match, keep." The probabilistic sample is the last clause, the fallback for everything that did not match a keep rule. The Collector ships a tail sampling processor https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/processor/tailsamplingprocessor/README.md . It buffers spans by trace ID, waits a configurable decision window after the root span finishes, then runs your policies. Here is a config that encodes the policy above. processors: tail sampling: decision wait: 10s num traces: 100000 policies: - name: keep-errors type: status code status code: status codes: ERROR - name: keep-slow type: latency latency: threshold ms: 10000 - name: keep-expensive type: numeric attribute numeric attribute: key: gen ai.usage.cost usd min value: 1 - name: keep-eval-traffic type: string attribute string attribute: key: eval.tag values: canary, regression, review - name: sample-the-rest type: probabilistic probabilistic: sampling percentage: 5 A few things to know about this processor. decision wait is how long the Collector holds a trace's spans after the last one arrives; set it longer than your slowest expected trace or you will make decisions on incomplete data. num traces is the in-memory buffer size, and it is a memory cost you have to budget for. The policies are an OR: a trace is kept if it matches any policy, so the cheap probabilistic rule never overrides a keep. The cost attribute gen ai.usage.cost usd is not standard. You set it yourself at instrumentation time, computed from the token counts the provider returns. The point is that tail sampling can route on any attribute you put on the span, so put the ones you want to filter on there. Tail rules are only as good as the span attributes you feed them. At instrumentation time, stamp the trace with what the Collector will need to decide. python from opentelemetry import trace tracer = trace.get tracer "llm-app" illustrative rates — set from your provider's pricing PRICE PER 1K = {"input": 0.003, "output": 0.015} def record llm call model, prompt, response, usage, tag=None : with tracer.start as current span "llm.chat" as span: span.set attribute "gen ai.request.model", model in tok = usage "input tokens" out tok = usage "output tokens" cost = in tok / 1000 PRICE PER 1K "input" + out tok / 1000 PRICE PER 1K "output" span.set attribute "gen ai.usage.cost usd", cost span.set attribute "gen ai.usage.input tokens", in tok span.set attribute "gen ai.usage.output tokens", out tok if tag: span.set attribute "eval.tag", tag return response When you run your hourly canary or a regression suite, set tag="canary" on those calls. The keep-eval-traffic policy then pins them at 100%, so your offline comparisons always have the full trace, never a sampled gap. A canary you only kept 5% of the time is a canary you cannot trust. The probabilistic rule undercounts your real volume. Once you sample the healthy traffic at 5%, any metric you compute from stored traces — request count, average cost, token throughput — is off by the sampling rate unless you correct for it. The fix is to derive volume and cost metrics from a separate, unsampled metrics pipeline, and treat traces as exemplars, not as the source of truth for counts. Sample your traces; never sample your counters. Tail sampling does not compose with load balancing for free. The tail sampling processor needs every span of a trace to land in the same Collector instance, because it decides per trace ID. If your spans fan out across a pool of Collectors behind a round-robin load balancer, a trace gets split and the decision is made on a fragment. The standard fix is a two-tier setup: a first tier that routes by trace ID the loadbalancing exporter https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/exporter/loadbalancingexporter/README.md does this , feeding a second tier that does the actual tail sampling. If you skip this and your traces look mysteriously truncated, this is why. You do not need the full two-tier collector on day one. Start with head sampling in the SDK at a flat rate, get traces flowing, learn what your traffic looks like. The day the bill stings, move to tail sampling and encode the keep rules. The policy above is a starting point; the thresholds are yours to set from your own latency budget and cost ceiling. The thing to hold onto is the principle. Errors, the slow tail, the expensive calls, and the eval set are not sampleable. Everything else is. Get that boundary right and you keep the spans that explain your incidents while dropping the ones that only explained that Tuesday was normal. Sampling is one decision in a longer chain instrumentation, evals, alerting, cost accounting that decides whether your LLM observability actually answers the questions you have at 2am. The book works through the whole stack, and the sampling and cost-tracking chapters go deeper on the trade-offs sketched here.