Ever since OpenAI launched ChatGPT in November 2022, AI usage has exploded worldwide. Integrating LLMs into applications began soon after, rapidly going from an experimental, nice-to-have feature to a competitive, baseline requirement.
And while you can find an AI implementation in almost every product today, shipping production-ready LLM features introduces its own set of challenges that developers must contend with.
In this article, we’ll dive into why observing LLM-based applications is now a critical requirement, what OpenTelemetry is, and how to integrate it into your applications with a practical demo.
During this process, we will also look at the current maturity level of LLM-specific OpenTelemetry libraries, the GenAI Semantic Conventions, and some practical challenges you can face while instrumenting your LLM applications.
Why do LLM applications need observability?
If you are already familiar with the challenges of maintaining LLM applications across their lifecycle, feel free to skip to the next section that discusses OpenTelemetry.
Handling non-determinism
Now you might think that observing your LLM integrations is not that different from classic observability. The key difference is that the output generated by LLMs is non-deterministic: the same input can produce completely different outputs across runs.
Developers often equip LLMs with dedicated tools since models can hallucinate unpredictably on tasks that require precise, deterministic output.
Ensuring context-appropriate responses
Non-determinism does not mean that the responses are actually incorrect. In most scenarios though, developers likely want their responses to be structured in a certain way.
For example, while the response "very likely" for a query like "chances of rain tomorrow" might be suitable, the same response for a query like "chances of stock market climbing tomorrow" might be unacceptable, where the user likely expects more nuance from the application system.
Ensuring that responses remain consistent across a range of user queries is one of the key factors that separates a polished LLM product from an unreliable one.
Managing quality across updates
LLM providers frequently release model updates, modify their backends, and provide optimal usage guides. Meanwhile, developers also experiment with model configurations and share the ones which work for them. All in all, the space is developing quickly, and each of these factors can affect the response quality of your LLM setup.
As a practical example, LLM providers can suffer "brown-outs" where their infrastructure cannot keep up with user demand, leading to latency spikes, timeouts, or even degraded response quality in certain scenarios, making it critical to observe how your LLM setup holds up over time.
What is OpenTelemetry?
OpenTelemetry (OTel) is a Cloud Native Computing Foundation (CNCF) project aimed at standardizing the way we instrument applications for generating telemetry data. Before OpenTelemetry arrived, telemetry data lived in silos and often had little or no correlation between signals.
It follows a specification-driven development model that standardizes telemetry generation and collection details, meaning any compatible backend can process and visualize telemetry data emitted via its SDKs.
As there is no need to rewrite the entire instrumentation plumbing each time you change observability backends, there is no vendor lock-in.
Implementing OpenTelemetry in LLM Applications
Prerequisites
- Python 3.12 or newer. Download the latest version. - A SigNoz Cloud accountfor visualizing the telemetry data. - An OpenAI API keyto use with the application. - An API client like PostmanorBrunofor managing API payloads and visualizing responses.
While earlier Python versions like 3.10 may technically work, they are nearing their end of life. Python 3.12 will continue to receive security updates till late 2028.
Setting up SigNoz
SigNoz is an OpenTelemetry-native observability platform that provides logs, traces, and metrics in a unified platform.
Sign upfor a free SigNoz Cloud account.Follow the documentationto create ingestion keys for your account.- Ensure the region and ingestion key values are readily accessible for the following steps.
Once done, you’re ready to configure the application and point it towards your SigNoz instance.
Running the Demo Application
Application Setup
Clone the SigNoz Examples repository and navigate to the application folder:
git clone https://github.com/SigNoz/examples.git
cd examples/python/opentelemetry-llm-demo
Create and activate a Python virtual environment.
python3.12 -m venv .venv
source .venv/bin/activate
The requirements.txt
file contains all the necessary OpenTelemetry Python packages. Install them by running:
python -m pip install -r requirements.txt
The following dependencies enable the OpenTelemetry instrumentation process:
opentelemetry-distro
: This provides a convenient mechanism to automatically configure some of the more common options for users, helping us get started with OpenTelemetry auto-instrumentation quickly.opentelemetry-exporter-otlp
: This package installsthe OTLPexporters required to transmit telemetry data to anyOpenTelemetry backend.
The following command detects standard libraries or frameworks (such as FastAPI) used in our application, and installs their respective instrumentation libraries:
opentelemetry-bootstrap --action=install
Finally, we will configure our environment variables and start the application, wrapping the entrypoint within opentelemetry-instrument
to auto-instrument our application code.
OPENAI_API_KEY="<your-openai-api-key>" \
OTEL_EXPORTER_OTLP_ENDPOINT="https://ingest.<your-region>.signoz.cloud:443" \
OTEL_EXPORTER_OTLP_HEADERS="signoz-ingestion-key=<your-ingestion-key>" \
OTEL_SERVICE_NAME="opentelemetry-llm-demo" \
OTEL_RESOURCE_ATTRIBUTES="service.version=0.1.0,deployment.environment=dev" \
OTEL_EXPORTER_OTLP_PROTOCOL="http/protobuf" \
OTEL_PYTHON_LOGGING_AUTO_INSTRUMENTATION_ENABLED=true \
opentelemetry-instrument fastapi run --port 8085 --workers 1
Replace the <your-region>
and the your-ingestion-key
placeholders with the region of your SigNoz workspace (e.g., us
, in
) and your newly created ingestion key. You will also need to supply your OpenAI API key.
OTEL_RESOURCE_ATTRIBUTES
defines the metadata attached to each batch of telemetry that goes out of our application, and the service name opentelemetry-llm-demo
ensures the OTel backend can correctly identify the telemetry source.
Setting OTEL_PYTHON_LOGGING_AUTO_INSTRUMENTATION_ENABLED
to true
ensures that application logs are exported alongside and correlated with the generated traces.
Run the command, and the FastAPI server should start on port 8085. Calling the /
endpoint will return a simple success response.
(.venv) ❯ curl "http://127.0.0.1:8085/"
{"message":"OpenTelemetry NBA agent demo is running"}
Before we move ahead, let’s look at how the OpenTelemetry community is standardizing telemetry standards, capture, and export processes for AI-driven applications.
Evolving GenAI Standards
Throughout this article, we’ve used the term "LLMs", but as become familiar with OpenTelemetry, you will see the term Generative AI
(or GenAI
) being used almost exclusively. OpenTelemetry uses the umbrella term GenAI
to refer to any application utilizing AI models.
Later when we use our application, you’ll see that all telemetry attributes and metrics ingested by SigNoz have the gen_ai.*
prefix, such as gen_ai.agent.name.
It is crucial to understand that these standards are evolving rapidly and are still technically in Development status. Because of this fast-paced development, and the criticality of the GenAI project, the OpenTelemetry team recently created a separate, dedicated repository to maintain GenAI instrumentation libraries.
This "immaturity" has a real-world impact on developers. For example, the opentelemetry-bootstrap
tool currently doesn’t recognize or install the OpenAI Agents SDK that we’ll be using in our application.
To counter this, we have manually added the corresponding entry to the requirements.txt file.
Further, you will often find that the current GenAI instrumentation libraries struggle to maintain full coverage for these rapidly changing AI implementations. We will explore this in more detail when we analyze our app’s trace output.
Dissecting the OpenTelemetry LLM Application
Now let's go over the key implementation details to fully understand how the agentic workflow has been wired up. Skip to the next section if you want to jump directly to the interactive part.
Dependency Choices
Besides the essential OpenTelemetry dependencies we’ve already discussed above, the application features:
FastAPI as the web framework, providing strong request-response data validation guarantees, automatic openAPI schema generation, and first-class async support, making it an excellent choice for building performant web APIs to serve AI workloads. - OpenAI Agents SDK for building our agentic workflow. It ships with all the bells and whistles you expect from an AI framework, such as sandbox mode, human in the loop mechanism, etc.OpenAI has published
extensive documentationmaking it easy to get started, and the project has amassed about27,000 GitHub starsat the time of writing. -
OpenAI Agents SDK Instrumentation, which is not included viaopentelemetry-bootstrap
command, was added manually to ensure our workflow generates telemetry upon execution.
The OpenAI Python SDK is a completely fine choice for building production-ready LLM applications that don’t need the complete set of agentic orchestration capabilities.
Note that the corresponding OTel instrumentation library currently lacks meaningful support for the newer Responses API, meaning you must rely on the Chat Completions API—a real handicap, since OpenAI recommends the Responses API for most projects due to its better API design, built-in agentic support, and improved cache utilization that cuts down on usage costs.
The OpenTelemetry GenAI team is hard at work, however: a PR for instrumenting the create
method of the Responses API was merged in April. We should expect more targeted updates in the near future.
The Core Agentic Setup
Our web application exposes an NBA Reporter agent that reports the latest NBA news given a topic (e.g., general
, or finals
), and performs on-demand analysis. When the user asks follow-up questions, our agent utilizes server sessions to analyze conversation history and provide context-aware output.
Defining the Agentic Workflow
The agent_service.py file defines the agent configuration and the executor function used in our FastAPI endpoint to serve user queries.
We initialize the agent with the system prompt, tools, input guardrails, and the model name. Since our agent is not expected to perform complex operations, we use the latest available version of GPT-5.4 Mini to minimize our usage costs without sacrificing the output quality.
NBA_AGENT = Agent(
name="NBA_Reporter",
instructions=NBA_INTERACTIVE_PROMPT,
tools=[WebSearchTool(), calculate_win_percentage],
input_guardrails=[nba_content_guardrail],
model=OPENAI_MODEL,
)
The calculate_win_percentage
tool helps the agent accurately calculate and display a team’s win percentage consistently.
@function_tool
def calculate_win_percentage(wins: int, losses: int) -> str:
"""Calculates the winning percentage for an NBA team given their wins and losses."""
total_games = wins + losses
if total_games == 0:
return ".000"
return f"{wins / total_games:.3f}".lstrip("0")
In our case, we use an input guardrail to limit discussions to basketball topics. On detecting a user query that does not contain any of the pre-defined keywords, our function returns a GuardrailFunctionOutput
with the tripwire_triggered
parameter set to True
.
@input_guardrail()
def nba_content_guardrail(
context: RunContextWrapper[None],
agent: Agent,
input_data: str | list,
) -> GuardrailFunctionOutput:
"""Ensures the user query is relevant to basketball/NBA."""
keywords = [
"nba",
"basketball",
"player",
... # removed for brevity
]
if isinstance(input_data, list):
latest_user_message = next(
(
item.get("content", "")
for item in reversed(input_data)
if isinstance(item, dict) and item.get("role") == "user"
),
"",
)
input_query = latest_user_message
else:
input_query = input_data
input_query = input_query.lower()
is_relevant = any(keyword in input_query for keyword in keywords)
if len(input_query) < 5 or is_relevant:
return GuardrailFunctionOutput(tripwire_triggered=False, output_info=None)
return GuardrailFunctionOutput(
tripwire_triggered=True,
output_info={
"reason": "The request is off-topic. Please ask questions relevant to NBA or basketball."
},
)
The tripwire_triggered=True
parameter signals the Agent to raise the InputGuardrailTripwireTriggered
exception and stop the agent execution loop. We intercept these exceptions through the FastAPI exception handlers to record the exception event and return an appropriate response.
@app.exception_handler(InputGuardrailTripwireTriggered)
async def handle_guardrail_block(
request: Request,
exc: InputGuardrailTripwireTriggered,
) -> JSONResponse:
guardrail_msg = exc.guardrail_result.output.output_info
span = trace.get_current_span()
span.record_exception(exc)
return JSONResponse(status_code=400, content={"detail": guardrail_msg})
Running the Agent and Managing Session Context
The run_agent_turn
function completes our agentic implementation.
The OpenAIConversationsSession object signals the workflow to leverage OpenAI-managed server sessions to maintain conversation context across turns. Each API response includes a
session_id
that uniquely identifies the session context: user and agent responses, tool call metadata, etc.If the user includes this session_id
in a follow-up request, the agent automatically retrieves the conversation context before processing the latest query, otherwise creating a new session to store the context for the current query.
def run_agent_turn(
topic: str,
user_message: str | None,
session_id: str | None,
) -> dict:
nba_topic = _validate_topic(topic)
prompt = build_nba_turn_prompt(nba_topic, user_message)
session = OpenAIConversationsSession(conversation_id=session_id)
result = Runner.run_sync(NBA_AGENT, prompt, session=session)
message = sanitize_agent_message((result.final_output or "").strip())
... # skipped for brevity
return {
"topic": topic,
"session_id": session.session_id,
"message": message,
"model": OPENAI_MODEL,
"usage": usage,
}
Visualizing Agentic Workflows with OpenTelemetry
Interacting with the Agent
Let’s start by asking the NBA Reporter agent a specific question about the upcoming NBA finals.
The response contains the agent’s analysis of KAT’s recent performances for the New York Knicks, the token usage stats, and a unique session_id
.
Let’s attach the session_id
to our follow-up question about this player. We will not explicitly refer to the player by name to see if the agent can access the existing context.
Great, the agent correctly identifies KAT from our previous query and returns the same session_id
, confirming that it re-used the active session instead of creating a new one.
Now, let’s see what happens if we ask it to answer an off-topic question, such as the weather in Barcelona.
Since our agent is equipped with a web search tool, any user could potentially prompt the agent to run costly, time-intensive searches, or access malicious resources. As an LLM application developer, it falls on you to implement strict guardrails to prevent data leaks and exploits.
Exploring the Steps within the Agent Workflow
Now, let’s see what the trace looks like for the successful request with the session_id
in the payload, and compare it to the one that failed the guardrail check.
Expanding the trace execution for the follow-up request, we can see that the workflow makes multiple API calls. Clicking on the initial GET and POST spans reveals the agent fetching and most likely saving conversation data, respectively.
Within the invoke_agent
span, we can see that the guardrail check has been documented as well, capturing the guardrail function name and whether the guardrail was triggered.
The span with the model name stores the conversation history and the last model output for the gen_ai.input.messages
and gen_ai.output.messages
span attributes.
While helpful for debugging, these span attributes can be incredibly verbose and may contain sensitive PII. To ensure user input and output content capture is explicitly disabled, you can use the following environment variable:
export OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT=0
Opening the detailed view of our failed trace, we can see that the root span links to the exception event we captured in our FastAPI exception handler. The gen_ai.guardrail.triggered
attribute on the guardrail_check
span is set to true
, clearly indicating that the guardrail in question blocked further execution.
You may have noticed that multiple spans in the trace tree are simply titled unknown
. These represent certain internal API paths that haven’t been fully mapped out as part of the current instrumentation process.
However, you can still deduce what is happening within these spans by referring to their child spans, which are correctly labelled (like the guardrail_check
span).
There is an open PR addressing many such conformance issues for the Agents SDK, so we can expect significant improvements in the near future.
Monitoring LLM Usage and Agent Performance
While we focused primarily on traces to map out our agent's lifecycle, the instrumentation wrapper simultaneously exposes core metrics like gen_ai.client.operation.duration
and gen_ai.client.token.usage
out of the box, useful for building dashboards on token spend and call volumes over time.
You can also import the SigNoz dashboard template for the OpenAI Python SDK. It is highly compatible with our LLM application, except for the cache utilization metrics, which aren’t yet exported by the current Agents SDK instrumentation path.
Wrapping Up
With this guide, you now have the fundamental knowledge required to begin instrumenting your LLM-based applications with OpenTelemetry.
We began by exploring the unique observability challenges introduced by non-deterministic models and complex agentic workflows. From there, we wired up a FastAPI application using the OpenAI Agents SDK, navigated the developing GenAI semantic conventions, and visualized how traces reveal exactly what happens under the hood.
SigNoz is an OpenTelemetry-native platform that visualizes traces, metrics, and logs in a single pane, making it incredibly easy to debug complex agentic loops and monitor token spend across your AI deployments.
If you’re interested in trying out SigNoz for your LLM applications, sign up for a 30-day free trial (no credit card required).