Observing LLM Applications with OpenTelemetry

OpenTelemetry, an open-source observability framework, is being adopted to monitor non-deterministic outputs and performance issues in LLM-based applications. The technology addresses challenges like hallucinations, inconsistent responses, and provider-side latency spikes that arise when integrating large language models into production systems. Developers can use OpenTelemetry's standardized instrumentation to collect telemetry data without vendor lock-in, enabling backend-agnostic monitoring of AI features.

Observing LLM Applications with OpenTelemetry Ever since OpenAI launched ChatGPT in November 2022, AI usage has exploded worldwide. Integrating LLMs into applications began soon after, rapidly going from an experimental, nice-to-have feature to a competitive, baseline requirement. And while you can find an AI implementation in almost every product today, shipping production-ready LLM features introduces its own set of challenges that developers must contend with. In this article, we’ll dive into why observing LLM-based applications is now a critical requirement, what OpenTelemetry is, and how to integrate it into your applications with a practical demo. During this process, we will also look at the current maturity level of LLM-specific OpenTelemetry libraries, the GenAI Semantic Conventions, and some practical challenges you can face while instrumenting your LLM applications. Why do LLM applications need observability? If you are already familiar with the challenges of maintaining LLM applications across their lifecycle, feel free to skip to the next section what-is-opentelemetry that discusses OpenTelemetry. Handling non-determinism Now you might think that observing your LLM integrations is not that different from classic observability. The key difference is that the output generated by LLMs is non-deterministic : the same input can produce completely different outputs across runs. Developers often equip LLMs with dedicated tools since models can hallucinate unpredictably on tasks that require precise, deterministic output. Ensuring context-appropriate responses Non-determinism does not mean that the responses are actually incorrect. In most scenarios though, developers likely want their responses to be structured in a certain way. For example, while the response "very likely" for a query like "chances of rain tomorrow" might be suitable, the same response for a query like "chances of stock market climbing tomorrow" might be unacceptable, where the user likely expects more nuance from the application system. Ensuring that responses remain consistent across a range of user queries is one of the key factors that separates a polished LLM product from an unreliable one. Managing quality across updates LLM providers frequently release model updates, modify their backends, and provide optimal usage guides. Meanwhile, developers also experiment with model configurations and share the ones which work for them. All in all, the space is developing quickly, and each of these factors can affect the response quality of your LLM setup. As a practical example, LLM providers can suffer "brown-outs" where their infrastructure cannot keep up with user demand, leading to latency spikes, timeouts, or even degraded response quality in certain scenarios, making it critical to observe how your LLM setup holds up over time. What is OpenTelemetry? OpenTelemetry https://signoz.io/opentelemetry/ OTel is a Cloud Native Computing Foundation CNCF project aimed at standardizing the way we instrument applications for generating telemetry data. Before OpenTelemetry arrived, telemetry data lived in silos and often had little or no correlation between signals. It follows a specification-driven development https://github.com/open-telemetry/opentelemetry-specification?tab=readme-ov-file model that standardizes telemetry generation and collection details, meaning any compatible backend can process and visualize telemetry data emitted via its SDKs. As there is no need to rewrite the entire instrumentation plumbing each time you change observability backends, there is no vendor lock-in . Implementing OpenTelemetry in LLM Applications Prerequisites - Python 3.12 or newer. Download the latest version https://www.python.org/downloads/ . - A SigNoz Cloud account https://signoz.io/teams/ for visualizing the telemetry data. - An OpenAI API key https://platform.openai.com/api-keys to use with the application. - An API client like Postman https://www.postman.com/ or Bruno https://www.usebruno.com/ for managing API payloads and visualizing responses. While earlier Python versions like 3.10 may technically work, they are nearing their end of life https://devguide.python.org/versions/ supported-versions . Python 3.12 will continue to receive security updates till late 2028. Setting up SigNoz SigNoz is an OpenTelemetry-native observability platform that provides logs, traces, and metrics in a unified platform. Sign up https://signoz.io/teams/ for a free SigNoz Cloud account. Follow the documentation https://signoz.io/docs/ingestion/signoz-cloud/keys/ to create ingestion keys for your account.- Ensure the region and ingestion key values are readily accessible for the following steps. Once done, you’re ready to configure the application and point it towards your SigNoz instance. Running the Demo Application Application Setup Clone the SigNoz Examples repository and navigate to the application folder: git clone https://github.com/SigNoz/examples.git cd examples/python/opentelemetry-llm-demo Create and activate a Python virtual environment. python3.12 -m venv .venv source .venv/bin/activate The requirements.txt file contains all the necessary OpenTelemetry Python https://signoz.io/docs/instrumentation/opentelemetry-python/ packages. Install them by running: python -m pip install -r requirements.txt The following dependencies enable the OpenTelemetry instrumentation process: opentelemetry-distro : This provides a convenient mechanism to automatically configure some of the more common options for users, helping us get started with OpenTelemetry auto-instrumentation quickly. opentelemetry-exporter-otlp : This package installs the OTLP https://signoz.io/blog/what-is-otlp/ exporters required to transmit telemetry data to any OpenTelemetry backend https://signoz.io/blog/opentelemetry-backend/ . The following command detects standard libraries or frameworks such as FastAPI used in our application, and installs their respective instrumentation libraries: opentelemetry-bootstrap --action=install Finally, we will configure our environment variables and start the application, wrapping the entrypoint within opentelemetry-instrument to auto-instrument our application code. OPENAI API KEY="<your-openai-api-key " \ OTEL EXPORTER OTLP ENDPOINT="https://ingest.<your-region .signoz.cloud:443" \ OTEL EXPORTER OTLP HEADERS="signoz-ingestion-key=<your-ingestion-key " \ OTEL SERVICE NAME="opentelemetry-llm-demo" \ OTEL RESOURCE ATTRIBUTES="service.version=0.1.0,deployment.environment=dev" \ OTEL EXPORTER OTLP PROTOCOL="http/protobuf" \ OTEL PYTHON LOGGING AUTO INSTRUMENTATION ENABLED=true \ opentelemetry-instrument fastapi run --port 8085 --workers 1 Replace the <your-region and the your-ingestion-key placeholders with the region of your SigNoz workspace e.g., us , in and your newly created ingestion key. You will also need to supply your OpenAI API key. OTEL RESOURCE ATTRIBUTES defines the metadata attached to each batch of telemetry that goes out of our application, and the service name opentelemetry-llm-demo ensures the OTel backend can correctly identify the telemetry source. Setting OTEL PYTHON LOGGING AUTO INSTRUMENTATION ENABLED to true ensures that application logs are exported alongside and correlated with the generated traces. Run the command, and the FastAPI server should start on port 8085. Calling the / endpoint will return a simple success response. .venv ❯ curl "http://127.0.0.1:8085/" {"message":"OpenTelemetry NBA agent demo is running"} Before we move ahead, let’s look at how the OpenTelemetry community is standardizing telemetry standards, capture, and export processes for AI-driven applications. Evolving GenAI Standards Throughout this article, we’ve used the term "LLMs", but as become familiar with OpenTelemetry, you will see the term Generative AI or GenAI being used almost exclusively. OpenTelemetry uses the umbrella term GenAI to refer to any application utilizing AI models. Later when we use our application, you’ll see that all telemetry attributes and metrics ingested by SigNoz have the gen ai. prefix, such as gen ai.agent.name https://opentelemetry.io/docs/specs/semconv/registry/attributes/gen-ai/ gen-ai-agent-name . It is crucial to understand that these standards are evolving rapidly and are still technically in Development status https://opentelemetry.io/docs/specs/semconv/gen-ai/ . Because of this fast-paced development, and the criticality of the GenAI project, the OpenTelemetry team recently created a separate, dedicated repository https://github.com/open-telemetry/opentelemetry-python-genai to maintain GenAI instrumentation libraries. This "immaturity" has a real-world impact on developers. For example, the opentelemetry-bootstrap tool currently doesn’t recognize or install the OpenAI Agents SDK that we’ll be using in our application. To counter this, we have manually added the corresponding entry to the requirements.txt file https://github.com/SigNoz/examples/blob/main/python/opentelemetry-llm-demo/requirements.txt L6 . Further, you will often find that the current GenAI instrumentation libraries struggle to maintain full coverage for these rapidly changing AI implementations. We will explore this in more detail when we analyze our app’s trace output. Dissecting the OpenTelemetry LLM Application Now let's go over the key implementation details to fully understand how the agentic workflow has been wired up. Skip to the next section visualizing-agentic-workflows-with-opentelemetry if you want to jump directly to the interactive part. Dependency Choices Besides the essential OpenTelemetry dependencies we’ve already discussed above, the application features: - FastAPI as the web framework, providing strong request-response data validation guarantees, automatic openAPI schema generation, and first-class async support, making it an excellent choice for building performant web APIs to serve AI workloads. - OpenAI Agents SDK for building our agentic workflow. It ships with all the bells and whistles you expect from an AI framework, such as sandbox mode, human in the loop mechanism, etc.OpenAI has published extensive documentation https://openai.github.io/openai-agents-python/ making it easy to get started, and the project has amassed about 27,000 GitHub stars https://github.com/openai/openai-agents-python at the time of writing. - OpenAI Agents SDK Instrumentation , which is not included via opentelemetry-bootstrap command, was added manually to ensure our workflow generates telemetry upon execution. The OpenAI Python SDK https://github.com/openai/openai-python is a completely fine choice for building production-ready LLM applications that don’t need the complete set of agentic orchestration capabilities. Note that the corresponding OTel instrumentation library currently lacks meaningful support for the newer Responses API, meaning you must rely on the Chat Completions API—a real handicap, since OpenAI recommends the Responses API for most projects due to its better API design, built-in agentic support, and improved cache utilization that cuts down on usage costs. The OpenTelemetry GenAI team is hard at work, however: a PR for instrumenting the create method of the Responses API was merged in April https://github.com/open-telemetry/opentelemetry-python-contrib/pull/4474 . We should expect more targeted updates in the near future. The Core Agentic Setup Our web application exposes an NBA Reporter agent that reports the latest NBA news given a topic e.g., general , or finals , and performs on-demand analysis. When the user asks follow-up questions, our agent utilizes server sessions to analyze conversation history and provide context-aware output. Defining the Agentic Workflow The agent service.py https://github.com/SigNoz/examples/blob/main/python/opentelemetry-llm-demo/app/agent service.py file defines the agent configuration and the executor function used in our FastAPI endpoint to serve user queries. We initialize the agent with the system prompt, tools, input guardrails, and the model name. Since our agent is not expected to perform complex operations, we use the latest available version of GPT-5.4 Mini to minimize our usage costs without sacrificing the output quality. NBA AGENT = Agent name="NBA Reporter", instructions=NBA INTERACTIVE PROMPT, tools= WebSearchTool , calculate win percentage , input guardrails= nba content guardrail , model=OPENAI MODEL, The calculate win percentage tool helps the agent accurately calculate and display a team’s win percentage consistently. php @function tool def calculate win percentage wins: int, losses: int - str: """Calculates the winning percentage for an NBA team given their wins and losses.""" total games = wins + losses if total games == 0: return ".000" return f"{wins / total games:.3f}".lstrip "0" In our case, we use an input guardrail to limit discussions to basketball topics. On detecting a user query that does not contain any of the pre-defined keywords, our function returns a GuardrailFunctionOutput with the tripwire triggered parameter set to True . python @input guardrail def nba content guardrail context: RunContextWrapper None , agent: Agent, input data: str | list, - GuardrailFunctionOutput: """Ensures the user query is relevant to basketball/NBA.""" keywords = "nba", "basketball", "player", ... removed for brevity extract the last user message from chat history if isinstance input data, list : latest user message = next item.get "content", "" for item in reversed input data if isinstance item, dict and item.get "role" == "user" , "", input query = latest user message else: input query = input data input query = input query.lower is relevant = any keyword in input query for keyword in keywords if len input query < 5 or is relevant: return GuardrailFunctionOutput tripwire triggered=False, output info=None return GuardrailFunctionOutput tripwire triggered=True, output info={ "reason": "The request is off-topic. Please ask questions relevant to NBA or basketball." }, The tripwire triggered=True parameter signals the Agent to raise the InputGuardrailTripwireTriggered exception and stop the agent execution loop. We intercept these exceptions through the FastAPI exception handlers to record the exception event and return an appropriate response. @app.exception handler InputGuardrailTripwireTriggered async def handle guardrail block request: Request, exc: InputGuardrailTripwireTriggered, - JSONResponse: guardrail msg = exc.guardrail result.output.output info span = trace.get current span span.record exception exc return JSONResponse status code=400, content={"detail": guardrail msg} Running the Agent and Managing Session Context The run agent turn function completes our agentic implementation. The OpenAIConversationsSession object https://openai.github.io/openai-agents-python/sessions/ signals the workflow to leverage OpenAI-managed server sessions to maintain conversation context across turns. Each API response includes a session id that uniquely identifies the session context: user and agent responses, tool call metadata, etc.If the user includes this session id in a follow-up request, the agent automatically retrieves the conversation context before processing the latest query, otherwise creating a new session to store the context for the current query. python def run agent turn topic: str, user message: str | None, session id: str | None, - dict: nba topic = validate topic topic prompt = build nba turn prompt nba topic, user message if no session ID was given, the sdk internally creates a session id during the turn subsequent calls which pass the ID maintain the conversation context session = OpenAIConversationsSession conversation id=session id result = Runner.run sync NBA AGENT, prompt, session=session message = sanitize agent message result.final output or "" .strip ... skipped for brevity return { "topic": topic, "session id": session.session id, "message": message, "model": OPENAI MODEL, "usage": usage, } Visualizing Agentic Workflows with OpenTelemetry Interacting with the Agent Let’s start by asking the NBA Reporter agent a specific question about the upcoming NBA finals. The response contains the agent’s analysis of KAT’s recent performances for the New York Knicks, the token usage stats, and a unique session id . Let’s attach the session id to our follow-up question about this player. We will not explicitly refer to the player by name to see if the agent can access the existing context. Great, the agent correctly identifies KAT from our previous query and returns the same session id , confirming that it re-used the active session instead of creating a new one. Now, let’s see what happens if we ask it to answer an off-topic question, such as the weather in Barcelona. Since our agent is equipped with a web search tool, any user could potentially prompt the agent to run costly, time-intensive searches, or access malicious resources. As an LLM application developer, it falls on you to implement strict guardrails to prevent data leaks and exploits. Exploring the Steps within the Agent Workflow Now, let’s see what the trace looks like for the successful request with the session id in the payload, and compare it to the one that failed the guardrail check. Expanding the trace execution for the follow-up request, we can see that the workflow makes multiple API calls. Clicking on the initial GET and POST spans reveals the agent fetching and most likely saving conversation data, respectively. Within the invoke agent span, we can see that the guardrail check has been documented as well, capturing the guardrail function name and whether the guardrail was triggered. The span with the model name stores the conversation history and the last model output for the gen ai.input.messages and gen ai.output.messages span attributes. While helpful for debugging, these span attributes can be incredibly verbose and may contain sensitive PII. To ensure user input and output content capture is explicitly disabled, you can use the following environment variable: export OTEL INSTRUMENTATION GENAI CAPTURE MESSAGE CONTENT=0 Opening the detailed view of our failed trace, we can see that the root span links to the exception event we captured in our FastAPI exception handler. The gen ai.guardrail.triggered attribute on the guardrail check span is set to true , clearly indicating that the guardrail in question blocked further execution. You may have noticed that multiple spans in the trace tree are simply titled unknown . These represent certain internal API paths that haven’t been fully mapped out as part of the current instrumentation process. However, you can still deduce what is happening within these spans by referring to their child spans, which are correctly labelled like the guardrail check span . There is an open PR https://github.com/open-telemetry/opentelemetry-python-genai/issues/86 addressing many such conformance issues for the Agents SDK, so we can expect significant improvements in the near future. Monitoring LLM Usage and Agent Performance While we focused primarily on traces to map out our agent's lifecycle, the instrumentation wrapper simultaneously exposes core metrics like gen ai.client.operation.duration and gen ai.client.token.usage out of the box, useful for building dashboards on token spend and call volumes over time. You can also import the SigNoz dashboard template https://signoz.io/docs/dashboards/dashboard-templates/openai-dashboard/ for the OpenAI Python SDK. It is highly compatible with our LLM application, except for the cache utilization metrics, which aren’t yet exported by the current Agents SDK instrumentation path. Wrapping Up With this guide, you now have the fundamental knowledge required to begin instrumenting your LLM-based applications with OpenTelemetry. We began by exploring the unique observability challenges introduced by non-deterministic models and complex agentic workflows. From there, we wired up a FastAPI application using the OpenAI Agents SDK, navigated the developing GenAI semantic conventions, and visualized how traces reveal exactly what happens under the hood. SigNoz https://signoz.io/ is an OpenTelemetry-native platform that visualizes traces, metrics, and logs in a single pane, making it incredibly easy to debug complex agentic loops and monitor token spend across your AI deployments. If you’re interested in trying out SigNoz for your LLM applications, sign up https://signoz.io/teams/ for a 30-day free trial no credit card required .