{"slug": "i-turned-on-agent-tracing-for-30-days-4-hidden-bottlenecks-were-eating-47-of-my", "title": "I Turned on Agent Tracing for 30 Days. 4 Hidden Bottlenecks Were Eating 47% of My Tokens.", "summary": "A developer traced a production Claude agent for 30 days and discovered four hidden bottlenecks consuming 47% of the monthly token budget. The agent, which performs code review and changelog drafting, was burning 5.2 million tokens per month—far more than its workload justified. Fixing the bottlenecks, including a Stripe API integration with no backoff that alone consumed 18% of tokens, cut the token bill in half without changing the agent's behavior.", "body_md": "I have a production Claude agent that has been running for about four months. It does code review on incoming PRs, drafts changelog entries, and occasionally summarizes a Slack channel. Nothing exotic. Nothing the marketing pages would put on a banner.\n\nIt was burning **5.2 million tokens a month**. I knew that because Anthropic's invoice told me. What the invoice did not tell me was where the tokens were going. The agent's logs said \"PR-1234 reviewed in 3 turns, 14k tokens.\" That math should not add up to 5.2M unless the agent is reviewing roughly 370 PRs a month. The team ships about 80 PRs a month.\n\nSo I turned on per-call tracing for 30 days. By the end of the month I had found four bottlenecks the existing logs were structurally unable to surface. Together they were eating 47% of the monthly token bill while contributing zero new behavior. Fixing them cut the bill in half without changing what the agent does.\n\nThis post is the four bottlenecks, the trace query that found each one, and the fix.\n\nI am not going to claim a perfect setup. What I did was the smallest amount of tracing that gave me ground truth. Specifically:\n\n`anthropic.messages.create`\n\n) with attributes `{model, input_tokens, output_tokens, cached_tokens, stop_reason}`\n\n`tool.call`\n\n) with `{server_name, tool_name, input_size_bytes, output_size_bytes, duration_ms, error}`\n\nI used OpenTelemetry with the GenAI semantic conventions (the 2026-03 revision, which is the one most exporters agree on right now). Storage was Pydantic Logfire because it groks the GenAI attributes out of the box and the free tier covered 30 days of one agent. Helicone and Langfuse work fine too; the brand of the dashboard does not matter, the per-call span does.\n\nThe \"per-call\" is the part that matters. Aggregated metrics (\"avg input_tokens by hour\") would have shown me the bill going up and not why. Per-call spans let me ask \"what was in the input on the 14 most expensive calls last week\" and answer with one query.\n\nThe first query I ran was \"show me agent turns where the same tool was called more than twice in a row.\" This is the kind of thing you would never look at in aggregate.\n\nIt came back with one offender: a Stripe API integration that was returning 429 rate-limit errors during peak hours. The Stripe MCP server had no backoff. The agent's behavior on tool error was to retry up to 7 times within a single turn. Each retry re-sent the full prompt context, because LLM calls do not have built-in idempotency. Seven retries on a 14k-token prompt is roughly 100k tokens to discover that Stripe is busy.\n\nThis was happening on roughly 30% of PRs that touched payment code. None of the agent's user-facing logs mentioned the retries because the agent successfully completed the turn after Stripe came back. From the outside, everything was fine. From the trace, it was the single biggest line item in the bill: about **18% of monthly tokens**.\n\nFix: backoff with jitter in the Stripe MCP server, and a hard cap of 2 retries per tool per turn at the agent level. Six lines of code. The fix shipped on day 9 of the 30-day tracing window and the next invoice cycle reflected it.\n\nThe agent's design has it re-reading `CLAUDE.md`\n\nat the start of every turn. This was deliberate when I built it; I wanted the agent to pick up changes to the rules without a restart. It is also approximately 4,000 tokens of context for the file alone, plus the agent generally needs 2-3 supporting files per turn (the codebase's `README.md`\n\n, an `OWNERS.md`\n\n, and a `style.md`\n\n).\n\nPer-call tracing showed me that the average PR review involved **14 turns**, and each turn re-fetched all four context files. Total context tokens per PR review: around 56k tokens, of which maybe 8k were genuinely needed (the diff and one or two relevant source files).\n\nFix: introduce a turn-level cache for read-only files using Anthropic's prompt-caching API (which existed but I had never bothered to wire in). The agent now reads `CLAUDE.md`\n\netc. once per session with `cache_control: ephemeral`\n\n, and subsequent turns hit the cache at 10% the cost. The aggregate effect was a **14% drop in monthly tokens**, with no behavior change.\n\nThe trace query that found this: \"group by `tool_name == read_file`\n\nAND `input.path == 'CLAUDE.md'`\n\n, then count per agent session.\" If I had been looking at the dashboard's \"average tokens per turn\" chart I would never have seen it because the average was hiding the multiplier.\n\nWhen the agent spawns three sub-agents for code review (architect / security / performance), it was passing the full PR diff to each one independently. The diff was, on average, 3,000 tokens. Three sub-agents getting 3,000 tokens of diff = 9,000 tokens of duplicate context per PR. Over 80 PRs a month, that is 720k tokens spent re-sending the same diff.\n\nFix: pass the diff once to a parent context, have the three sub-agents reference the same cached block. With prompt caching, the second and third sub-agents pay 10% of the input cost on the shared context. Same effect: roughly **9% monthly drop**.\n\nThe trace query: \"group spans by `parent_trace_id`\n\nand look at input similarity across child spans.\" Most observability tools cannot answer this out of the box; I exported the spans to a Jupyter notebook and ran a quick diff. The duplication was 92% byte-for-byte.\n\nThis is the cheap one and the funniest one. Across about 40% of the agent's responses, the model was opening with some variation of \"You're absolutely right\" or \"Great question\" followed by a short paraphrase of the prompt before answering. Per turn, that adds up to roughly 120 tokens of output that contributes zero information.\n\nPer turn, 120 tokens is nothing. Over a month of an agent making roughly 1,100 turns, with output tokens priced higher than input, it added up to about **6% of monthly tokens**.\n\nFix: a `system`\n\ninstruction that says \"Do not restate the user's request or open with agreement. Begin with the answer.\" Output tokens per turn dropped by an average of 95 within a day.\n\nThe trace query: \"show me the first 200 characters of `output_text`\n\nacross all turns this week.\" Half of them started the same way. This is the kind of thing you only see when you can look at the actual content, not the metrics.\n\nI keep coming back to this because it took me a while to internalize.\n\nThe aggregated metrics I had before (average tokens per turn, total cost per day, p95 latency) showed me that the bill was going up. They could not show me which behavior of the agent was responsible. The bottlenecks above were all \"this turn cost a normal amount; there are just a lot of these turns\" patterns. Aggregates hide them by design.\n\nPer-call tracing is annoying. It produces a lot of data. The Logfire UI for one month of one agent had about 1.4 million spans. You cannot read them. You can query them, which is the entire point. Every one of the four bottlenecks was a one-line trace query that I could not have asked of any aggregate dashboard.\n\nThree things I left in place after the 30 days:\n\n**Per-call spans, always on.** The instrumentation cost is roughly 2% in latency overhead and negligible in storage cost on the free tiers. I do not turn it off when I am \"done debugging\" because the next bottleneck will look exactly like these four did: silently expensive, invisible to aggregates.\n\n**A weekly trace audit.** Every Monday I run six saved queries (the four above plus two on tail latency and error patterns). It takes 10 minutes. It catches one new issue roughly every 6-8 weeks.\n\n**A budget alert at 80% of last month.** If the agent's token consumption is on pace to beat last month by 20% with no design change, something is wrong. The alert fires before the invoice does.\n\nI do not trust the agent's user-facing logs to tell me what it is doing. The agent's own logs are a summary. The summary is written by the same model whose behavior I am trying to measure. There is no version of that loop that is not going to flatter itself.\n\nI also do not use only token-cost metrics. The four bottlenecks above are all \"I am paying for behavior I do not want.\" The next four will probably be \"I am paying for behavior I do want but pricing has changed\" or \"I am paying for tail latency in a tool I depend on.\" Those need different queries. Per-call spans are the substrate that lets me write the next query whenever I think of it.\n\nThe lesson, if there is one, is the same as it was 30 years ago for backend services: you cannot manage what you cannot see, and aggregates are not seeing, they are summarizing. AI agents are a system. Treat them like one.\n\nThe longer write-up on the OpenTelemetry GenAI conventions, the per-platform tracing setup (Logfire / Helicone / Langfuse), and the W3C trace-context plumbing that connects sub-agents to their parents is in [Observability Across Frontend and Backend](https://kenimoto.dev/books/harness-engineering-guide?utm_source=devto&utm_medium=article&utm_campaign=agent-tracing-30d). The harness chapter is where the budget-alert loop lives.", "url": "https://wpnews.pro/news/i-turned-on-agent-tracing-for-30-days-4-hidden-bottlenecks-were-eating-47-of-my", "canonical_source": "https://dev.to/kenimo49/i-turned-on-agent-tracing-for-30-days-4-hidden-bottlenecks-were-eating-47-of-my-tokens-1pa6", "published_at": "2026-05-27 22:00:01+00:00", "updated_at": "2026-05-27 22:10:24.671173+00:00", "lang": "en", "topics": ["ai-agents", "large-language-models", "mlops", "ai-tools", "ai-infrastructure"], "entities": ["Anthropic", "Claude", "OpenTelemetry"], "alternates": {"html": "https://wpnews.pro/news/i-turned-on-agent-tracing-for-30-days-4-hidden-bottlenecks-were-eating-47-of-my", "markdown": "https://wpnews.pro/news/i-turned-on-agent-tracing-for-30-days-4-hidden-bottlenecks-were-eating-47-of-my.md", "text": "https://wpnews.pro/news/i-turned-on-agent-tracing-for-30-days-4-hidden-bottlenecks-were-eating-47-of-my.txt", "jsonld": "https://wpnews.pro/news/i-turned-on-agent-tracing-for-30-days-4-hidden-bottlenecks-were-eating-47-of-my.jsonld"}}