# Trace Sampling for LLM Apps: Keep the Spans That Matter, Drop the Rest

> Source: <https://dev.to/gabrielanhaia/trace-sampling-for-llm-apps-keep-the-spans-that-matter-drop-the-rest-3ejj>
> Published: 2026-06-13 11:00:10+00:00

You ship an LLM feature. Traffic grows. The observability bill grows faster.

Every chat turn is a trace. Every trace carries the full prompt, the full response, the retrieved chunks, the tool calls, the token counts. A single agent run can be twenty spans, each one fat with text. At a thousand requests a second, you are writing terabytes a day to a backend that charges per gigabyte ingested and per gigabyte stored. The first instinct is to store everything, because you never know which trace you will need. That instinct is correct right up until the invoice arrives.

So you sample. The question is how. Sample wrong and you drop the one trace the customer is complaining about. Sample right and you keep the spans that matter and drop the noise. This is about getting the policy right.

There are two places you can make the keep-or-drop decision.

**Head sampling** decides at the start of the trace, before you know how it ends. The root span flips a weighted coin: keep 10% of traffic, drop the other 90%. The OpenTelemetry SDK can do this with a built-in sampler, no extra infrastructure. It is cheap because the dropped traces are never serialized or sent anywhere.

The catch: the decision is blind. A request that ends in a 500, a timeout, or a $4 token bill has the same 10% chance of survival as a boring cache hit. You will drop most of your errors. For an LLM app, where the interesting failures hide inside HTTP 200s, that is the wrong trade.

**Tail sampling** decides at the end, once the whole trace has finished and you can see what happened. It runs in a collector, not the SDK, because the collector buffers all the spans of a trace until the root closes, then applies rules. Now you can say: keep every trace that errored, keep every trace slower than 10 seconds, keep every trace that cost more than a dollar, and keep 5% of the rest. You pay to buffer everything briefly, but you only store what you decide to keep.

For LLM applications, tail sampling is where the value is. The signal you care about is exactly what head sampling throws away at random: the slow call, the expensive call, the failed tool sequence.

Before any config, write the policy down in words. Mine, for a typical LLM product:

The order matters. The rules are evaluated as a chain of "if any of these match, keep." The probabilistic sample is the last clause, the fallback for everything that did not match a keep rule.

The Collector ships a [ tail_sampling processor](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/processor/tailsamplingprocessor/README.md). It buffers spans by trace ID, waits a configurable decision window after the root span finishes, then runs your policies. Here is a config that encodes the policy above.

```
processors:
  tail_sampling:
    decision_wait: 10s
    num_traces: 100000
    policies:
      - name: keep-errors
        type: status_code
        status_code:
          status_codes: [ERROR]

      - name: keep-slow
        type: latency
        latency:
          threshold_ms: 10000

      - name: keep-expensive
        type: numeric_attribute
        numeric_attribute:
          key: gen_ai.usage.cost_usd
          min_value: 1

      - name: keep-eval-traffic
        type: string_attribute
        string_attribute:
          key: eval.tag
          values: [canary, regression, review]

      - name: sample-the-rest
        type: probabilistic
        probabilistic:
          sampling_percentage: 5
```

A few things to know about this processor. `decision_wait`

is how long the Collector holds a trace's spans after the last one arrives; set it longer than your slowest expected trace or you will make decisions on incomplete data. `num_traces`

is the in-memory buffer size, and it is a memory cost you have to budget for. The policies are an OR: a trace is kept if it matches any policy, so the cheap probabilistic rule never overrides a keep.

The cost attribute (`gen_ai.usage.cost_usd`

) is not standard. You set it yourself at instrumentation time, computed from the token counts the provider returns. The point is that tail sampling can route on any attribute you put on the span, so put the ones you want to filter on there.

Tail rules are only as good as the span attributes you feed them. At instrumentation time, stamp the trace with what the Collector will need to decide.

``` python
from opentelemetry import trace

tracer = trace.get_tracer("llm-app")

# illustrative rates — set from your provider's pricing
PRICE_PER_1K = {"input": 0.003, "output": 0.015}

def record_llm_call(model, prompt, response, usage, tag=None):
    with tracer.start_as_current_span("llm.chat") as span:
        span.set_attribute("gen_ai.request.model", model)
        in_tok = usage["input_tokens"]
        out_tok = usage["output_tokens"]
        cost = (
            in_tok / 1000 * PRICE_PER_1K["input"]
            + out_tok / 1000 * PRICE_PER_1K["output"]
        )
        span.set_attribute("gen_ai.usage.cost_usd", cost)
        span.set_attribute("gen_ai.usage.input_tokens", in_tok)
        span.set_attribute("gen_ai.usage.output_tokens", out_tok)
        if tag:
            span.set_attribute("eval.tag", tag)
        return response
```

When you run your hourly canary or a regression suite, set `tag="canary"`

on those calls. The `keep-eval-traffic`

policy then pins them at 100%, so your offline comparisons always have the full trace, never a sampled gap. A canary you only kept 5% of the time is a canary you cannot trust.

**The probabilistic rule undercounts your real volume.** Once you sample the healthy traffic at 5%, any metric you compute from stored traces — request count, average cost, token throughput — is off by the sampling rate unless you correct for it. The fix is to derive volume and cost metrics from a separate, unsampled metrics pipeline, and treat traces as exemplars, not as the source of truth for counts. Sample your traces; never sample your counters.

**Tail sampling does not compose with load balancing for free.** The `tail_sampling`

processor needs every span of a trace to land in the same Collector instance, because it decides per trace ID. If your spans fan out across a pool of Collectors behind a round-robin load balancer, a trace gets split and the decision is made on a fragment. The standard fix is a two-tier setup: a first tier that routes by trace ID (the [ loadbalancing exporter](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/exporter/loadbalancingexporter/README.md) does this), feeding a second tier that does the actual tail sampling. If you skip this and your traces look mysteriously truncated, this is why.

You do not need the full two-tier collector on day one. Start with head sampling in the SDK at a flat rate, get traces flowing, learn what your traffic looks like. The day the bill stings, move to tail sampling and encode the keep rules. The policy above is a starting point; the thresholds are yours to set from your own latency budget and cost ceiling.

The thing to hold onto is the principle. Errors, the slow tail, the expensive calls, and the eval set are not sampleable. Everything else is. Get that boundary right and you keep the spans that explain your incidents while dropping the ones that only explained that Tuesday was normal.

Sampling is one decision in a longer chain (instrumentation, evals, alerting, cost accounting) that decides whether your LLM observability actually answers the questions you have at 2am. The book works through the whole stack, and the sampling and cost-tracking chapters go deeper on the trade-offs sketched here.
