{"slug": "instrumenting-ai-agents-for-the-agent-timeline-a-practical-opentelemetry-guide", "title": "Instrumenting AI Agents for the Agent Timeline: A Practical OpenTelemetry Guide", "summary": "Honeycomb released a technical guide on instrumenting AI agents with OpenTelemetry's GenAI semantic conventions to enable debugging via the Agent Timeline, which captures tool calls, multi-agent handoffs, and downstream service spans in a single conversation view. The guide emphasizes that LLM errors are rarely the root cause of agent failures and provides manual instrumentation examples using Python and OpenTelemetry SDK.", "body_md": "# Instrumenting AI Agents for the Agent Timeline: A Practical OpenTelemetry Guide\n\nThe LLM is rarely the root cause of agent failures. This technical guide shows how to instrument AI agents using OpenTelemetry's GenAI semantic conventions so they appear in Honeycomb's Agent Timeline—including tool calls, multi-agent handoffs, and framework-specific SDKs—so you can debug what actually went wrong.\n\nBy: [Dan Juengst](/author/dan-juengst)\n\n#### Agent Timeline: The Flight Recorder for Your AI Agents\n\nEvery LLM call, every tool invocation, every agent handoff, every downstream service span, in one conversation, in one view. Now in Early Access.\n\n[Read Now](/blog/agent-timeline-flight-recorder-for-your-ai-agents)\n\nAI agents are nondeterministic, multi-step, and opaque. When one fails in production, \"the model said something weird\" is the cheapest, most useless line in your incident postmortem. To debug agents the way they actually run, you need telemetry that captures all of it, in order, with enough context to reconstruct what happened.\n\nThe OpenTelemetry GenAI Semantic Conventions give you a vendor-neutral way to do exactly that. Instrument your agent with the right attributes and [Honeycomb's Agent Timeline](https://docs.honeycomb.io/investigate/observe/agent-timeline) renders the whole conversation: model calls, tool calls, agent handoffs, failures, and the downstream API and database work all of that triggered, bound together by a shared conversation ID.\n\nHere's how to [instrument your agents](https://docs.honeycomb.io/send-data/agents) so they show up in the Timeline and become debuggable once they're there.\n\n## The three attributes you need to start\n\nThe Agent Timeline [groups multiple traces and multiple agents](https://docs.honeycomb.io/send-data/use-cases/agents) into a single conversation view. To make that work, every span in your agent's execution chain needs three attributes:\n\nOne subtlety that matters: **a \"GenAI span\" is not just an LLM call**. It's any span anywhere in the execution chain triggered by an agent, including downstream database queries, third-party API calls, or background jobs that ran because the agent decided to call a tool. If a span exists because the agent did something, it should carry the conversation ID.\n\nThat's what makes the Timeline work end to end. Without the conversation ID propagating into your downstream system spans, you get the LLM-only view that dedicated AI observability tools stop at, except in production. But the root cause isn’t often the LLM.\n\n# Learn more about Honeycomb Intelligence\n\nConnect with our experts today.\n\n## A minimum viable example\n\nHere's manual instrumentation in Python using the [OpenTelemetry](/platform/opentelemetry) SDK. This assumes you've already configured an OTLP exporter pointing at Honeycomb.\n\n``` python\nimport json\nimport uuid\nfrom opentelemetry import trace\nfrom opentelemetry.trace import Status, StatusCode\n\ntracer = trace.get_tracer(\"my-agent\")\n\ndef run_agent(user_message: str):\n    conversation_id = str(uuid.uuid4())\n\n    with tracer.start_as_current_span(\"invoke_agent support_agent\") as span:\n        span.set_attribute(\"gen_ai.conversation.id\", conversation_id)\n        span.set_attribute(\"gen_ai.agent.name\", \"support_agent\")\n        span.set_attribute(\"gen_ai.operation.name\", \"invoke_agent\")\n\n        return call_llm(user_message, conversation_id)\n\ndef call_llm(message: str, conversation_id: str):\n    with tracer.start_as_current_span(\"chat gpt-4o\") as span:\n        span.set_attribute(\"gen_ai.conversation.id\", conversation_id)\n        span.set_attribute(\"gen_ai.agent.name\", \"support_agent\")\n        span.set_attribute(\"gen_ai.operation.name\", \"chat\")\n        span.set_attribute(\"gen_ai.request.model\", \"gpt-4o\")\n\n        # ... actual LLM call ...\n\n        span.set_attribute(\"gen_ai.response.model\", \"gpt-4o-2024-08-06\")\n        span.set_attribute(\"gen_ai.usage.input_tokens\", 142)\n        span.set_attribute(\"gen_ai.usage.output_tokens\", 87)\n        return result\n```\n\nThread `conversation_id`\n\nthrough your call stack so every span, including downstream HTTP clients, database queries, and queue workers can attach it. That's how you get the full-stack picture rather than an LLM-only one.\n\n## Lighting up the rest of the Timeline UI\n\nThe attributes below are what make the Timeline genuinely actionable: token usage, model identification, tool call debugging, and failure detection.\n\n### Token usage\n\nSet these on every chat or completion span:\n\n`gen_ai.usage.input_tokens`\n\n`gen_ai.usage.output_tokens`\n\n`gen_ai.usage.cache_read.input_tokens`\n\n`gen_ai.usage.cache_creation.input_tokens`\n\nOnce they're queryable as high-cardinality attributes, you can correlate token spend with model, latency, conversation outcome, and user sentiment in a single query.\n\n### Model identification\n\n`gen_ai.request.model`\n\n- what you asked for`gen_ai.response.model`\n\n- what you got\n\nThese often differ. You request `gpt-4o`\n\nand get a specific dated version like `gpt-4o-2024-08-06`\n\n. Capturing both is how you debug behavior changes after a silent provider-side model upgrade.\n\n### Tool calls\n\nTool calls are where most agentic failures live. Instrument every tool execution span like this:\n\n```\nwith tracer.start_as_current_span(f\"execute_tool {tool_name}\") as span:\n    span.set_attribute(\"gen_ai.conversation.id\", conversation_id)\n    span.set_attribute(\"gen_ai.agent.name\", \"support_agent\")\n    span.set_attribute(\"gen_ai.operation.name\", \"execute_tool\")\n    span.set_attribute(\"gen_ai.tool.name\", tool_name)\n    span.set_attribute(\"gen_ai.tool.call.id\", tool_call_id)\n    span.set_attribute(\"gen_ai.tool.call.arguments\", json.dumps(args))\n\n    try:\n        result = execute(tool_name, args)\n        span.set_attribute(\"gen_ai.tool.call.result\", json.dumps(result))\n        span.set_attribute(\"gen_ai.response.finish_reasons\", json.dumps([\"stop\"]))\n        return result\n    except Exception as e:\n        span.set_attribute(\"error.type\", type(e).__name__)\n        span.set_status(Status(StatusCode.ERROR, str(e)))\n        raise\n```\n\nIf a tool call fails, set `error.type`\n\nand propagate the error status to the parent span. The Timeline's \"Show Failures Only\" mode and the conversation-level failure count both rely on this signal. This is what turns failures into first-class navigation primitives instead of needles in a haystack. In addition, if the tool call can accept a propagated `gen_ai.conversation.id`\n\nand send OpenTelemetry spans, you can track exactly what happens within that tool call.\n\n### Prompts and responses (with PII caveat)\n\n`gen_ai.input.messages`\n\n- full prompts`gen_ai.output.messages`\n\n- full responses\n\nThese make root-cause investigation dramatically faster because you can read what the agent was told and what it said. They also capture PII and sensitive data by default. Treat them like any other sensitive payload: redact at the application layer, scrub at the OpenTelemetry Collector with a processor, or restrict capture to non-production environments based on your data-classification rules.\n\n### Embeddings\n\nOn embedding spans, set `gen_ai.request.model`\n\nand `gen_ai.usage.input_tokens`\n\n.\n\n### Evaluation results\n\nAttach `gen_ai.evaluation.result`\n\nevents to GenAI operation spans for hallucination, bias, relevance, or any custom eval signal. This is what closes the loop between cost, latency, and quality. They are all queryable together as span data.\n\n## Multi-agent instrumentation\n\nFor multi-agent systems, two rules:\n\n**Each agent gets its own** This drives swim lanes and handoff visibility in the Timeline. Sub-agents use their own distinct names; they don't inherit from the parent. If the attribute is missing, the span shows up as \"Unknown,\" which defeats the point.`gen_ai.agent.name`\n\n.**The calling agent emits the** The called agent then emits its own`invoke_agent`\n\nspan, not the agent being called.`chat`\n\n,`execute_tool`\n\n, and other spans under its own`gen_ai.agent.name`\n\n. This makes the handoff itself an explicit, queryable event in the trace.\n\n```\n# Orchestrator agent invoking a specialist agent\nwith tracer.start_as_current_span(\"invoke_agent billing_agent\") as span:\n    span.set_attribute(\"gen_ai.conversation.id\", conversation_id)\n    span.set_attribute(\"gen_ai.agent.name\", \"orchestrator\")   # the caller\n    span.set_attribute(\"gen_ai.operation.name\", \"invoke_agent\")\n\n    # billing_agent emits its own spans under\n    # gen_ai.agent.name = \"billing_agent\"\n    return billing_agent.handle(query, conversation_id)\n```\n\n## Span naming conventions\n\nConsistent span names matter because they are what the Timeline uses to group and render operations correctly:\n\n## Doing this with the SDKs you use\n\nManual instrumentation works for any agent, but in practice most teams build on a framework or vendor SDK. Here's how to think about each.\n\n**OpenAI Python SDK.** The `openai`\n\npackage doesn't emit GenAI semconv spans natively, but OpenTelemetry contrib [auto-instrumentation](/blog/what-is-auto-instrumentation) packages exist that emit `chat`\n\nand `embeddings`\n\nspans with the right attributes out of the box. Drop the instrumentation in and you get LLM-layer telemetry for free. You still need to set `gen_ai.conversation.id`\n\nand `gen_ai.agent.name`\n\nyourself. Wrap LLM calls in a parent span you control, and the child spans the auto-instrumentation emits will inherit the conversation context.\n\n**Anthropic Python SDK.** Auto-instrumentation exists in the OpenTelemetry contrib ecosystem; combine it with your own conversation-scoping span so the conversation ID is in scope when the SDK call fires.\n\n**LangChain and LangGraph.** LangChain's callback system can be wired to OpenTelemetry through community packages. Auto-instrumentation gets you the LLM and tool spans, but you're still responsible for the conversation ID, the agent name (especially in multi-agent graphs where every node is a distinct agent), and propagating trace context into any custom tools or downstream services LangChain doesn't see.\n\nThe pattern across all of them is the same: **let the framework instrumentation own the LLM-layer spans, and you own the agent-layer and conversation-layer attributes.** Auto-instrumentation can't infer your conversation boundaries or your agent identity. That's a property of your application.\n\n## What you'll see when it works\n\nWith these attributes flowing into Honeycomb, an Agent Timeline view of one conversation gives you:\n\n- A conversation summary: total duration, model calls, tool calls, retries, agents involved, failure count\n- Horizontal swim lanes per agent, with explicit handoffs between them\n- Inline highlights on failing spans and a \"Show Failures Only\" filter\n- A Gen AI panel showing prompts, responses, token usage, model name, and finish reasons\n- The full trace waterfall including downstream API and database spans, pivotable into Canvas for system-wide investigation across many conversations\n\nThat's the difference between knowing an agent failed and understanding why. The model is rarely the root cause. Instrument the whole chain, including LLM calls, tool calls, handoffs, and the downstream system work, which points you in the right direction the next time something breaks.", "url": "https://wpnews.pro/news/instrumenting-ai-agents-for-the-agent-timeline-a-practical-opentelemetry-guide", "canonical_source": "https://www.honeycomb.io/blog/instrumenting-ai-agents-agent-timeline-opentelemetry-guide", "published_at": "2026-06-29 13:00:00+00:00", "updated_at": "2026-06-30 12:23:35.574358+00:00", "lang": "en", "topics": ["ai-agents", "ai-tools", "ai-infrastructure", "developer-tools"], "entities": ["Honeycomb", "OpenTelemetry", "Dan Juengst", "Agent Timeline", "GenAI semantic conventions", "Python", "OTLP"], "alternates": {"html": "https://wpnews.pro/news/instrumenting-ai-agents-for-the-agent-timeline-a-practical-opentelemetry-guide", "markdown": "https://wpnews.pro/news/instrumenting-ai-agents-for-the-agent-timeline-a-practical-opentelemetry-guide.md", "text": "https://wpnews.pro/news/instrumenting-ai-agents-for-the-agent-timeline-a-practical-opentelemetry-guide.txt", "jsonld": "https://wpnews.pro/news/instrumenting-ai-agents-for-the-agent-timeline-a-practical-opentelemetry-guide.jsonld"}}