{"slug": "opentelemetry-genai-trace-llm-calls-and-agents-in-production", "title": "OpenTelemetry GenAI: Trace LLM Calls and Agents in Production", "summary": "OpenTelemetry GenAI semantic conventions have stabilized for LLM client spans as of early 2026, enabling developers to trace individual model calls, token usage, and agent tool interactions in production. The specification provides a tiered stability map for attributes, with gen_ai.chat and gen_ai.embeddings spans considered stable for dashboards, while gen_ai.agent.* spans remain experimental and mcp.* spans are in development. This instrumentation transforms opaque agent requests into debuggable span hierarchies, revealing exactly where latency and token consumption occur.", "body_md": "If you are running AI agents in production without OpenTelemetry instrumentation, you are operating blind. You know the request took 6 seconds and cost $0.18 — but not which of the four model calls inside that agent loop caused the latency spike, how many tokens the reasoning step consumed versus the tool call, or whether a tool failed silently. The OpenTelemetry GenAI semantic conventions fix this. LLM client span attributes stabilized in early 2026, and you can get your first model call instrumented in about fifteen minutes.\n\n## Know What Is Stable Before You Build\n\nThe single biggest confusion in the community is not knowing which OTel GenAI attributes are safe to put in production dashboards. The tier map:\n\n**gen_ai.chat and gen_ai.embeddings spans**— Stable. Ship these to production dashboards today.** gen_ai.agent.* spans**— Experimental. Useful, but expect attribute renames. Use the opt-in flag.** mcp.* spans**— Development. The spec is still being written. Do not build dashboards on these yet.\n\nThe environment variable that unlocks experimental attributes without breaking existing dashboards:\n\n```\nOTEL_SEMCONV_STABILITY_OPT_IN=gen_ai_latest_experimental\n```\n\nThis dual-emits both legacy and new attribute names. If you have existing dashboards built on older attribute names, they keep working while you migrate.\n\n## What Traditional APM Is Not Showing You\n\nStandard application performance monitoring gives you the outer HTTP call. You see `POST /v1/messages 4.2s 200 OK`\n\n. That is it. The [official gen_ai client span specification](https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/) adds the attributes that actually matter:\n\n`gen_ai.request.model`\n\n— Which model handled the call`gen_ai.usage.input_tokens`\n\nand`gen_ai.usage.output_tokens`\n\n— Exactly how many tokens burned`gen_ai.response.finish_reasons`\n\n— Why generation stopped. “max_tokens” means your output got truncated — a bug, not a feature.`gen_ai.provider.name`\n\n— Useful when routing across providers\n\nCombine these with standard trace timestamps and you shift from “something was slow” to “LLM call 2 used 890 input tokens and ran for 3.1 seconds — that is where your latency is coming from.”\n\n## The Span Hierarchy That Makes Agents Debuggable\n\nThe thing that actually unlocks agent debugging is the parent-child span relationship. When every LLM call and every tool call is a typed child span, your trace viewer shows you exactly where time and tokens went:\n\n```\nagent.run (total: 5.8s)\n├── gen_ai.chat  1.2s  450 tokens\n│   └── gen_ai.tool.call: search_web  0.8s\n├── gen_ai.chat  3.1s  890 tokens   ← your problem\n│   ├── gen_ai.tool.call: read_file   0.1s\n│   └── gen_ai.tool.call: write_file  0.2s\n└── gen_ai.chat  0.9s  210 tokens\n```\n\nWithout instrumentation, the 5.8-second request is a black box. With it, you see LLM call 2 burned 890 input tokens and that is where you focus your optimization work. For teams not using a framework that already emits these spans, the manual instrumentation is straightforward:\n\n``` python\nfrom opentelemetry import trace\nfrom opentelemetry.semconv.ai import SpanAttributes\n\ntracer = trace.get_tracer(\"myapp.ai\")\n\nwith tracer.start_as_current_span(\"gen_ai.chat\") as span:\n    span.set_attribute(SpanAttributes.GEN_AI_SYSTEM, \"anthropic\")\n    span.set_attribute(SpanAttributes.GEN_AI_REQUEST_MODEL, \"claude-sonnet-4-6\")\n\n    response = client.messages.create(...)\n\n    span.set_attribute(SpanAttributes.GEN_AI_USAGE_INPUT_TOKENS,\n                       response.usage.input_tokens)\n    span.set_attribute(SpanAttributes.GEN_AI_USAGE_OUTPUT_TOKENS,\n                       response.usage.output_tokens)\n    span.set_attribute(\"gen_ai.response.finish_reasons\",\n                       [response.stop_reason])\n```\n\n## Most Teams Are Already Halfway There\n\nIf you are using a popular AI framework, it likely already emits OTel-compliant spans. LangChain emits native OTel spans via the langchain-opentelemetry package. CrewAI emits spans for agent tasks and tool calls. AutoGen and AG2 both have OTel instrumentation packages. For framework users, the practical path: set `OTEL_EXPORTER_OTLP_ENDPOINT`\n\nto your collector URL, restart, and your framework handles span creation. No instrumentation code — just an environment variable.\n\n## One Standard, Every Backend\n\nThe strategic case for OTel over vendor-specific SDKs: every major observability platform now supports gen_ai.* attributes natively. [Datadog announced native support](https://www.datadoghq.com/blog/llm-otel-semantic-convention/) for OTel GenAI semantic conventions. Honeycomb, New Relic, Grafana, and Dynatrace all support them. Instrument once against the standard, route your telemetry to any backend, switch backends without touching instrumentation code. Vendor-specific LLM observability SDKs do not offer this. You instrument with their SDK, you are locked to their platform — a bet worth avoiding in a market moving this fast.\n\n## What Is Coming Next\n\nThe OTel GenAI SIG is actively expanding three areas. The [agent span semantic conventions](https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-agent-spans/) are growing to cover multi-agent systems — tasks, agent teams, memory operations, and artifact tracking. Stable mcp.* attributes for MCP tool tracing are in progress; when those land, you will have end-to-end visibility from agent invocation through every MCP tool call. Standardized cost-tracking attributes and quality signals like time-to-first-token are also on the roadmap.\n\nThe overhead argument against instrumentation does not hold: OTel adds under 1ms per call, and LLM API latency runs 100ms to 30 seconds. Per [OpenTelemetry’s GenAI observability guide](https://opentelemetry.io/blog/2026/genai-observability/), the cost of not instrumenting — debugging agent failures by guesswork — is substantially higher. If you are shipping agents to production, instrument them first. Visibility before features.", "url": "https://wpnews.pro/news/opentelemetry-genai-trace-llm-calls-and-agents-in-production", "canonical_source": "https://byteiota.com/opentelemetry-genai-trace-llm-calls-and-agents-in-production/", "published_at": "2026-06-19 11:13:43+00:00", "updated_at": "2026-06-19 11:42:51.539924+00:00", "lang": "en", "topics": ["artificial-intelligence", "large-language-models", "ai-agents", "ai-tools", "developer-tools"], "entities": ["OpenTelemetry", "LangChain", "CrewAI", "Anthropic", "Claude"], "alternates": {"html": "https://wpnews.pro/news/opentelemetry-genai-trace-llm-calls-and-agents-in-production", "markdown": "https://wpnews.pro/news/opentelemetry-genai-trace-llm-calls-and-agents-in-production.md", "text": "https://wpnews.pro/news/opentelemetry-genai-trace-llm-calls-and-agents-in-production.txt", "jsonld": "https://wpnews.pro/news/opentelemetry-genai-trace-llm-calls-and-agents-in-production.jsonld"}}