How to Monitor AI Agents in Production OpenObserve has introduced a monitoring solution for AI agents in production that relies on distributed tracing rather than logs, as a single user request can trigger ten or more internal operations across LLM calls, tool invocations, and agent steps. The system uses OpenTelemetry's GenAI semantic conventions to standardize span attributes for these operations, with auto-instrumentation libraries like OpenLLMetry, OpenInference, and OpenLIT requiring only two to three lines of initialization code. Traces are shipped to OpenObserve over OTLP, enabling SQL-queryable trace data, token usage dashboards, cost attribution by agent and model, and alerting on latency and cost anomalies. TLDR - Monitoring AI agents in production requires distributed tracing: a single user request fans out into 10 or more internal operations, and logs alone cannot show you which step is slow, failing, or burning your token budget. - OpenTelemetry's gen ai. semantic conventions give you standardized span attributes for LLM calls, tool invocations, and agent steps. Some are stable today; others are still experimental.- Auto-instrumentation libraries OpenLLMetry, OpenInference, OpenLIT cover most agent frameworks with two to three lines of initialization code. You do not change your agent code. - Traces ship to OpenObserve over OTLP. From there you get SQL-queryable trace data, token usage dashboards, cost attribution by agent and model, and alerting on latency and cost anomalies. - OpenObserve also exposes an MCP server. You can query your live agent traces from a Claude or GPT session without opening a dashboard. A single LLM call is straightforward to observe. One HTTP request, one response, one latency number. You can log the input and output and call it done. An agent is different. When a user sends a message, the agent calls an LLM to decide what to do, invokes a tool, processes the result, calls the LLM again, possibly calls another tool, and eventually returns a response. That one user message becomes ten or more internal operations. Some of those operations call external APIs. Some retry. Some spawn sub-agents. Without distributed tracing, you see none of this structure. You know the response took 8 seconds. You do not know whether the LLM took 7 of those seconds or whether a tool made three retries before timing out. Four categories of problems appear in production agents that you cannot debug without traces: Distributed tracing gives you a complete record of every operation, in order, with timing and attributes. That record is what makes these questions answerable. OpenTelemetry's GenAI semantic conventions define a standard set of span attributes for AI workloads. The stable attributes you can build on today: | Attribute | What it captures | |---|---| gen ai.system | LLM provider: openai, anthropic, cohere | gen ai.operation.name | Operation type: chat, embeddings, text completion | gen ai.request.model | Model name: gpt-4o, claude-3-5-sonnet-20241022 | gen ai.usage.input tokens | Tokens consumed by the prompt | gen ai.usage.output tokens | Tokens in the model response | gen ai.response.finish reasons | Why the model stopped: stop, tool calls, length | For agent-specific spans, the conventions extend to gen ai.agent.name , gen ai.agent.description , gen ai.tool.name , and gen ai.tool.description . These are still marked experimental as of early 2026 but are already implemented by the major instrumentation libraries and are stable enough to use in production. For a full breakdown of what OpenTelemetry captures for LLM workloads, including how SRE teams use the three signal types together, see OpenTelemetry for LLMs: Complete SRE Guide https://openobserve.ai/blog/opentelemetry-for-llms/ . Every significant operation in an agent's lifecycle becomes a span: gen ai.chat : wraps a single LLM API call. Carries model name, token counts, and finish reason. gen ai.tool : wraps a single tool invocation. Child of the LLM call span that requested it. agent.step : wraps one full reasoning cycle. Parent of all LLM and tool spans within that cycle.Prompt and completion content is large. Storing it as span attributes inflates trace payloads and storage costs. The OTel GenAI convention puts prompt and completion content into span events typed gen ai.content.prompt and gen ai.content.completion rather than attributes. Events attach to the span but are stored separately, keeping the attribute payload small while preserving full content for debugging. In practice: leave content capture enabled during development. Before shipping to production, disable it at the application level or route it through the Collector for redaction. When an orchestrator delegates to a worker agent, the worker's spans need to appear under the same root trace. For HTTP-based delegation, include the W3C traceparent header in the outgoing request and extract it in the worker. For in-process delegation LangGraph node transitions, OpenAI Agents SDK handoffs , auto-instrumentation handles this automatically. Three libraries sit between your agent code and the OTel SDK. The examples in this blog use LangChain and the OpenAI Agents SDK, both supported by all three libraries. For support across other frameworks CrewAI, AutoGen, DSPy, and more , check each library's docs. | Library | Signals | LangChain | OpenAI Agents | Config overhead | |---|---|---|---|---| OpenLLMetry traceloop-sdk | Traces + Metrics + Logs | Yes | Yes | Medium | | OpenInference | Traces only | Yes | Yes | Low | | OpenLIT | Traces + Metrics | Yes | Yes | Minimal | OpenLLMetry captures the most signals and covers the widest framework catalog. OpenLIT is the easiest entry point: one import, one function call. OpenInference is traces-only but has the closest alignment with OTel GenAI semantic conventions. For teams starting out: use OpenLLMetry. For teams already running an OTel SDK setup: use the official opentelemetry-instrumentation- packages from opentelemetry-python-contrib , which include opentelemetry-instrumentation-langchain and opentelemetry-instrumentation-openai-agents-v2 . For a full walkthrough of OpenLIT with OpenObserve, including pre-built dashboards for GPU and vector database monitoring, see LLM Observability for AI Applications with OpenObserve and OpenLIT https://openobserve.ai/blog/observability-for-ai-applications-using-openobserve-and-openlit/ . For a broader comparison of open-source LLM observability tooling, see Top Open Source LLM Observability Tools https://openobserve.ai/blog/llm-observability-tools/ . The following examples use LangChain and the OpenAI Agents SDK. The instrumentation pattern is the same for virtually every other agent framework: install a library, initialize before importing framework classes, point the exporter at your backend. LangChain's current recommended approach for building agents uses LangGraph as the execution runtime. The opentelemetry-instrumentation-langchain package instruments both. Install: pip install opentelemetry-sdk \ opentelemetry-exporter-otlp-proto-http \ opentelemetry-instrumentation-openai \ langgraph langchain-openai Initialize before any LangChain imports: python from opentelemetry.sdk.trace import TracerProvider from opentelemetry.sdk.trace.export import BatchSpanProcessor from opentelemetry.exporter.otlp.proto.http.trace exporter import OTLPSpanExporter from opentelemetry.instrumentation.openai import OpenAIInstrumentor exporter = OTLPSpanExporter endpoint="