{"slug": "trace-sampling-for-llm-apps-keep-the-spans-that-matter-drop-the-rest", "title": "Trace Sampling for LLM Apps: Keep the Spans That Matter, Drop the Rest", "summary": "An engineer details how to implement tail sampling for LLM application traces to control observability costs while preserving critical data. The approach uses OpenTelemetry's tail sampling processor to keep traces with errors, high latency, high cost, or specific evaluation tags, dropping only the rest probabilistically.", "body_md": "You ship an LLM feature. Traffic grows. The observability bill grows faster.\n\nEvery chat turn is a trace. Every trace carries the full prompt, the full response, the retrieved chunks, the tool calls, the token counts. A single agent run can be twenty spans, each one fat with text. At a thousand requests a second, you are writing terabytes a day to a backend that charges per gigabyte ingested and per gigabyte stored. The first instinct is to store everything, because you never know which trace you will need. That instinct is correct right up until the invoice arrives.\n\nSo you sample. The question is how. Sample wrong and you drop the one trace the customer is complaining about. Sample right and you keep the spans that matter and drop the noise. This is about getting the policy right.\n\nThere are two places you can make the keep-or-drop decision.\n\n**Head sampling** decides at the start of the trace, before you know how it ends. The root span flips a weighted coin: keep 10% of traffic, drop the other 90%. The OpenTelemetry SDK can do this with a built-in sampler, no extra infrastructure. It is cheap because the dropped traces are never serialized or sent anywhere.\n\nThe catch: the decision is blind. A request that ends in a 500, a timeout, or a $4 token bill has the same 10% chance of survival as a boring cache hit. You will drop most of your errors. For an LLM app, where the interesting failures hide inside HTTP 200s, that is the wrong trade.\n\n**Tail sampling** decides at the end, once the whole trace has finished and you can see what happened. It runs in a collector, not the SDK, because the collector buffers all the spans of a trace until the root closes, then applies rules. Now you can say: keep every trace that errored, keep every trace slower than 10 seconds, keep every trace that cost more than a dollar, and keep 5% of the rest. You pay to buffer everything briefly, but you only store what you decide to keep.\n\nFor LLM applications, tail sampling is where the value is. The signal you care about is exactly what head sampling throws away at random: the slow call, the expensive call, the failed tool sequence.\n\nBefore any config, write the policy down in words. Mine, for a typical LLM product:\n\nThe order matters. The rules are evaluated as a chain of \"if any of these match, keep.\" The probabilistic sample is the last clause, the fallback for everything that did not match a keep rule.\n\nThe Collector ships a [ tail_sampling processor](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/processor/tailsamplingprocessor/README.md). It buffers spans by trace ID, waits a configurable decision window after the root span finishes, then runs your policies. Here is a config that encodes the policy above.\n\n```\nprocessors:\n  tail_sampling:\n    decision_wait: 10s\n    num_traces: 100000\n    policies:\n      - name: keep-errors\n        type: status_code\n        status_code:\n          status_codes: [ERROR]\n\n      - name: keep-slow\n        type: latency\n        latency:\n          threshold_ms: 10000\n\n      - name: keep-expensive\n        type: numeric_attribute\n        numeric_attribute:\n          key: gen_ai.usage.cost_usd\n          min_value: 1\n\n      - name: keep-eval-traffic\n        type: string_attribute\n        string_attribute:\n          key: eval.tag\n          values: [canary, regression, review]\n\n      - name: sample-the-rest\n        type: probabilistic\n        probabilistic:\n          sampling_percentage: 5\n```\n\nA few things to know about this processor. `decision_wait`\n\nis how long the Collector holds a trace's spans after the last one arrives; set it longer than your slowest expected trace or you will make decisions on incomplete data. `num_traces`\n\nis the in-memory buffer size, and it is a memory cost you have to budget for. The policies are an OR: a trace is kept if it matches any policy, so the cheap probabilistic rule never overrides a keep.\n\nThe cost attribute (`gen_ai.usage.cost_usd`\n\n) is not standard. You set it yourself at instrumentation time, computed from the token counts the provider returns. The point is that tail sampling can route on any attribute you put on the span, so put the ones you want to filter on there.\n\nTail rules are only as good as the span attributes you feed them. At instrumentation time, stamp the trace with what the Collector will need to decide.\n\n``` python\nfrom opentelemetry import trace\n\ntracer = trace.get_tracer(\"llm-app\")\n\n# illustrative rates — set from your provider's pricing\nPRICE_PER_1K = {\"input\": 0.003, \"output\": 0.015}\n\ndef record_llm_call(model, prompt, response, usage, tag=None):\n    with tracer.start_as_current_span(\"llm.chat\") as span:\n        span.set_attribute(\"gen_ai.request.model\", model)\n        in_tok = usage[\"input_tokens\"]\n        out_tok = usage[\"output_tokens\"]\n        cost = (\n            in_tok / 1000 * PRICE_PER_1K[\"input\"]\n            + out_tok / 1000 * PRICE_PER_1K[\"output\"]\n        )\n        span.set_attribute(\"gen_ai.usage.cost_usd\", cost)\n        span.set_attribute(\"gen_ai.usage.input_tokens\", in_tok)\n        span.set_attribute(\"gen_ai.usage.output_tokens\", out_tok)\n        if tag:\n            span.set_attribute(\"eval.tag\", tag)\n        return response\n```\n\nWhen you run your hourly canary or a regression suite, set `tag=\"canary\"`\n\non those calls. The `keep-eval-traffic`\n\npolicy then pins them at 100%, so your offline comparisons always have the full trace, never a sampled gap. A canary you only kept 5% of the time is a canary you cannot trust.\n\n**The probabilistic rule undercounts your real volume.** Once you sample the healthy traffic at 5%, any metric you compute from stored traces — request count, average cost, token throughput — is off by the sampling rate unless you correct for it. The fix is to derive volume and cost metrics from a separate, unsampled metrics pipeline, and treat traces as exemplars, not as the source of truth for counts. Sample your traces; never sample your counters.\n\n**Tail sampling does not compose with load balancing for free.** The `tail_sampling`\n\nprocessor needs every span of a trace to land in the same Collector instance, because it decides per trace ID. If your spans fan out across a pool of Collectors behind a round-robin load balancer, a trace gets split and the decision is made on a fragment. The standard fix is a two-tier setup: a first tier that routes by trace ID (the [ loadbalancing exporter](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/exporter/loadbalancingexporter/README.md) does this), feeding a second tier that does the actual tail sampling. If you skip this and your traces look mysteriously truncated, this is why.\n\nYou do not need the full two-tier collector on day one. Start with head sampling in the SDK at a flat rate, get traces flowing, learn what your traffic looks like. The day the bill stings, move to tail sampling and encode the keep rules. The policy above is a starting point; the thresholds are yours to set from your own latency budget and cost ceiling.\n\nThe thing to hold onto is the principle. Errors, the slow tail, the expensive calls, and the eval set are not sampleable. Everything else is. Get that boundary right and you keep the spans that explain your incidents while dropping the ones that only explained that Tuesday was normal.\n\nSampling is one decision in a longer chain (instrumentation, evals, alerting, cost accounting) that decides whether your LLM observability actually answers the questions you have at 2am. The book works through the whole stack, and the sampling and cost-tracking chapters go deeper on the trade-offs sketched here.", "url": "https://wpnews.pro/news/trace-sampling-for-llm-apps-keep-the-spans-that-matter-drop-the-rest", "canonical_source": "https://dev.to/gabrielanhaia/trace-sampling-for-llm-apps-keep-the-spans-that-matter-drop-the-rest-3ejj", "published_at": "2026-06-13 11:00:10+00:00", "updated_at": "2026-06-13 11:17:37.824250+00:00", "lang": "en", "topics": ["large-language-models", "ai-infrastructure", "mlops", "developer-tools"], "entities": ["OpenTelemetry", "tail_sampling processor"], "alternates": {"html": "https://wpnews.pro/news/trace-sampling-for-llm-apps-keep-the-spans-that-matter-drop-the-rest", "markdown": "https://wpnews.pro/news/trace-sampling-for-llm-apps-keep-the-spans-that-matter-drop-the-rest.md", "text": "https://wpnews.pro/news/trace-sampling-for-llm-apps-keep-the-spans-that-matter-drop-the-rest.txt", "jsonld": "https://wpnews.pro/news/trace-sampling-for-llm-apps-keep-the-spans-that-matter-drop-the-rest.jsonld"}}