OpenTelemetry GenAI: Trace LLM Calls and Agents in Production

OpenTelemetry GenAI semantic conventions have stabilized for LLM client spans as of early 2026, enabling developers to trace individual model calls, token usage, and agent tool interactions in production. The specification provides a tiered stability map for attributes, with gen_ai.chat and gen_ai.embeddings spans considered stable for dashboards, while gen_ai.agent.* spans remain experimental and mcp.* spans are in development. This instrumentation transforms opaque agent requests into debuggable span hierarchies, revealing exactly where latency and token consumption occur.

If you are running AI agents in production without OpenTelemetry instrumentation, you are operating blind. You know the request took 6 seconds and cost $0.18 — but not which of the four model calls inside that agent loop caused the latency spike, how many tokens the reasoning step consumed versus the tool call, or whether a tool failed silently. The OpenTelemetry GenAI semantic conventions fix this. LLM client span attributes stabilized in early 2026, and you can get your first model call instrumented in about fifteen minutes. Know What Is Stable Before You Build The single biggest confusion in the community is not knowing which OTel GenAI attributes are safe to put in production dashboards. The tier map: gen ai.chat and gen ai.embeddings spans — Stable. Ship these to production dashboards today. gen ai.agent. spans — Experimental. Useful, but expect attribute renames. Use the opt-in flag. mcp. spans — Development. The spec is still being written. Do not build dashboards on these yet. The environment variable that unlocks experimental attributes without breaking existing dashboards: OTEL SEMCONV STABILITY OPT IN=gen ai latest experimental This dual-emits both legacy and new attribute names. If you have existing dashboards built on older attribute names, they keep working while you migrate. What Traditional APM Is Not Showing You Standard application performance monitoring gives you the outer HTTP call. You see POST /v1/messages 4.2s 200 OK . That is it. The official gen ai client span specification https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/ adds the attributes that actually matter: gen ai.request.model — Which model handled the call gen ai.usage.input tokens and gen ai.usage.output tokens — Exactly how many tokens burned gen ai.response.finish reasons — Why generation stopped. “max tokens” means your output got truncated — a bug, not a feature. gen ai.provider.name — Useful when routing across providers Combine these with standard trace timestamps and you shift from “something was slow” to “LLM call 2 used 890 input tokens and ran for 3.1 seconds — that is where your latency is coming from.” The Span Hierarchy That Makes Agents Debuggable The thing that actually unlocks agent debugging is the parent-child span relationship. When every LLM call and every tool call is a typed child span, your trace viewer shows you exactly where time and tokens went: agent.run total: 5.8s ├── gen ai.chat 1.2s 450 tokens │ └── gen ai.tool.call: search web 0.8s ├── gen ai.chat 3.1s 890 tokens ← your problem │ ├── gen ai.tool.call: read file 0.1s │ └── gen ai.tool.call: write file 0.2s └── gen ai.chat 0.9s 210 tokens Without instrumentation, the 5.8-second request is a black box. With it, you see LLM call 2 burned 890 input tokens and that is where you focus your optimization work. For teams not using a framework that already emits these spans, the manual instrumentation is straightforward: python from opentelemetry import trace from opentelemetry.semconv.ai import SpanAttributes tracer = trace.get tracer "myapp.ai" with tracer.start as current span "gen ai.chat" as span: span.set attribute SpanAttributes.GEN AI SYSTEM, "anthropic" span.set attribute SpanAttributes.GEN AI REQUEST MODEL, "claude-sonnet-4-6" response = client.messages.create ... span.set attribute SpanAttributes.GEN AI USAGE INPUT TOKENS, response.usage.input tokens span.set attribute SpanAttributes.GEN AI USAGE OUTPUT TOKENS, response.usage.output tokens span.set attribute "gen ai.response.finish reasons", response.stop reason Most Teams Are Already Halfway There If you are using a popular AI framework, it likely already emits OTel-compliant spans. LangChain emits native OTel spans via the langchain-opentelemetry package. CrewAI emits spans for agent tasks and tool calls. AutoGen and AG2 both have OTel instrumentation packages. For framework users, the practical path: set OTEL EXPORTER OTLP ENDPOINT to your collector URL, restart, and your framework handles span creation. No instrumentation code — just an environment variable. One Standard, Every Backend The strategic case for OTel over vendor-specific SDKs: every major observability platform now supports gen ai. attributes natively. Datadog announced native support https://www.datadoghq.com/blog/llm-otel-semantic-convention/ for OTel GenAI semantic conventions. Honeycomb, New Relic, Grafana, and Dynatrace all support them. Instrument once against the standard, route your telemetry to any backend, switch backends without touching instrumentation code. Vendor-specific LLM observability SDKs do not offer this. You instrument with their SDK, you are locked to their platform — a bet worth avoiding in a market moving this fast. What Is Coming Next The OTel GenAI SIG is actively expanding three areas. The agent span semantic conventions https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-agent-spans/ are growing to cover multi-agent systems — tasks, agent teams, memory operations, and artifact tracking. Stable mcp. attributes for MCP tool tracing are in progress; when those land, you will have end-to-end visibility from agent invocation through every MCP tool call. Standardized cost-tracking attributes and quality signals like time-to-first-token are also on the roadmap. The overhead argument against instrumentation does not hold: OTel adds under 1ms per call, and LLM API latency runs 100ms to 30 seconds. Per OpenTelemetry’s GenAI observability guide https://opentelemetry.io/blog/2026/genai-observability/ , the cost of not instrumenting — debugging agent failures by guesswork — is substantially higher. If you are shipping agents to production, instrument them first. Visibility before features.