{"slug": "is-claude-code-getting-worse-how-to-measure-degradation-with-opentelemetry", "title": "Is Claude Code Getting Worse? How to Measure Degradation with OpenTelemetry", "summary": "Claude Code users report declining output quality and efficiency, with complaints of less accurate edits and increased back-and-forth sessions appearing across developer communities. Anthropic's AI coding tool exports telemetry through OpenTelemetry, allowing teams to track output-per-token ratios—measuring lines of code, commits, and pull requests against token consumption—to detect degradation patterns like context bloat, cache misses, and rejected edits before they impact team velocity.", "body_md": "# Is Claude Code Getting Worse? How to Measure Degradation with OpenTelemetry\n\nThe Productivity Problem Nobody Is Measuring\n\nSomething has been quietly frustrating developers who use Claude Code regularly. The complaints are showing up across forums, channels, and developer communities: Claude feels worse than it used to. The edits aren't as accurate. It takes more back-and-forth to get the same result. The output quality seems to have dipped. Most of these complaints are anecdotal, and that's exactly the problem.\n\nThe deeper issue is that most teams aren't measuring the right thing. They might track token usage or API costs, but tokens are an input. What actually matters is what you're getting for those tokens:\n\n- Lines of code written\n- Commits created\n- Pull requests merged.\n\nIf your token consumption stays flat but your output per token quietly declines week over week, Claude Code is becoming less productive for your team. Without the right telemetry, you'd never know until the drop in developer velocity makes itself felt.\n\nEfficiency degradation is invisible until it's significant. By the time you notice it in your team's output, it's already been happening for weeks.\n\nBy the end of this post, you'll know:\n\n- Which signals to track to catch efficiency degradation before it shows up in team velocity\n- How to turn them into actionable dashboards\n- What specific patterns to look out for\n\nWhy Output-Per-Token Is the Metric That Actually Matters\n\nTokens are an input, not an outcome. Spending 100,000 tokens on a session tells you nothing about whether that session was productive. What tells you that is how many lines of code were added, how many commits were created, how many pull requests were opened, which are the actual outputs that move your codebase forward.\n\nThe ratio between those outputs and the tokens consumed is your efficiency signal. A healthy Claude Code deployment shows stable or improving output-per-token ratios over time. Degradation shows up as those ratios declining, with more tokens being consumed for the same or less output.\n\nWhat causes efficiency to degrade? There are four patterns that consistently appear:\n\n| Driver | What happens | How it shows up |\n|---|---|---|\nContext bloat | Sessions grow heavier over time; the full conversation history is sent with every request, so input tokens compound as a session runs longer | More tokens consumed per session, no corresponding output increase |\nCache misses | Repeated context is re-processed at full input cost instead of being served cheaply from cache | Falling cache hit rate drags every output-per-token ratio down |\nSubagent multiplication | Agentic workflows spawn independent background API calls via the Task tool, multiplying token consumption several times over | Subagent token share grows while output ratios decline |\nRejected edits | Tokens spent generating an edit that gets thrown away contribute nothing to output | Rising rejection rate; token spend increases without corresponding lines-of-code output |\n\nNone of these patterns are visible in aggregate cost or token usage data alone. You need output per token ratios to see them and that's what the panels in this post are designed to give you.\n\nHow Claude Code Exposes Telemetry\n\nClaude Code exports observability data through OpenTelemetry, the open standard for collecting and exporting telemetry. It supports three signal types: metrics, logs, and traces. For efficiency monitoring you'll work primarily with metrics and logs.\n\nGetting telemetry flowing requires just a handful of environment variables:\n\n```\n# Enable telemetry\nexport CLAUDE_CODE_ENABLE_TELEMETRY=1\n\n# Configure exporters\nexport OTEL_METRICS_EXPORTER=otlp\nexport OTEL_LOGS_EXPORTER=otlp\n\n# Point to your collector\nexport OTEL_EXPORTER_OTLP_PROTOCOL=grpc\nexport OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317\n```\n\nThis works with any OTLP-compatible backend. Check out the [Claude Code monitoring guide](https://signoz.io/docs/claude-code-monitoring/) for detailed setup instructions.\n\nThe metrics that form the foundation of efficiency monitoring are:\n\n| Metric | What it tracks |\n|---|---|\n`claude_code.token.usage` | Tokens consumed per API request, broken down by `type` |\n`claude_code.lines_of_code.count` | Lines of code added or removed |\n`claude_code.commit.count` | Git commits created via Claude Code |\n`claude_code.pull_request.count` | Pull requests created via Claude Code |\n`claude_code.code_edit_tool.decision` | Accept/reject decisions on code edits |\n\nThe key attributes that make segmentation possible:\n\n| Attribute | What it enables |\n|---|---|\n`type` | Break token usage into `input` , `output` , `cacheRead` , `cacheCreation` |\n`query_source` | Separate `main` from agentic (`subagent` ) token spend |\n`decision` | Split edit decisions into `accept` and `reject` |\n`session.id` | Per-session efficiency rollups |\n`user.email` | Per-user efficiency tracking |\n\nFor multi-team orgs, `OTEL_RESOURCE_ATTRIBUTES`\n\nlets you attach custom dimensions like `department`\n\n, `team.id`\n\n, or `cost_center`\n\nto every metric and log, enabling team-level efficiency tracking without any changes to how developers use the tool.\n\nThe Panels That Actually Matter and What They Tell You\n\nRaw telemetry data is only useful if it's shaped into something you can actually read and act on. The panels below are organized around three questions: are we degrading, where is efficiency being lost, and what specifically is causing it.\n\nAre we degrading?\n\nThese are your headline metrics. They give you a direct, quantitative answer to whether Claude Code is becoming less productive over time. Track them as time series and watch the trend. The absolute values matter less than the direction.\n\nLines of Code Added Per 1M Tokens\n\n**Metric:**`claude_code.lines_of_code.count`\n\n(`type = added`\n\n) ÷`claude_code.token.usage`\n\n× 1,000,000\n\nThis is the most direct efficiency signal available. It tells you how many lines of code Claude Code is producing per million tokens consumed. A stable or rising line means efficiency is holding or improving. A declining trend is your clearest signal that something is degrading.\n\nBe careful not to over-index on short-term dips. A single day of low output could just mean developers were working on complex refactors that produce fewer net lines. What you're watching for is a consistent downward trend across multiple days or weeks.\n\nCommits Per 1M Tokens\n\n**Metric:**`claude_code.commit.count`\n\n÷`claude_code.token.usage`\n\n× 1,000,000\n\nWhere lines of code per 1M tokens measures raw output, commits per 1M tokens measures completed units of work. A commit represents a coherent, accepted change, so this ratio captures not just how much code Claude Code is producing, but how much of it is making it through to completion.\n\nWatch for divergence between this panel and lines of code per 1M tokens. If lines per 1M tokens is stable but commits per 1M tokens is falling, Claude Code may be producing code that isn't making it to commit. Worth cross-referencing with the edit rejection rate panel.\n\nPRs Per 1M Tokens\n\n**Metric:**`claude_code.pull_request.count`\n\n÷`claude_code.token.usage`\n\n× 1,000,000\n\nPull requests represent the highest-level unit of completed output: code that's been written, committed, and submitted for review. PRs per 1M tokens is your broadest efficiency signal and one indicator of Claude Code's contribution to your development workflow.\n\nThis ratio tends to move more slowly than the others because PRs accumulate over longer time windows. Use it as a weekly or monthly trend rather than a daily one. A quarter-over-quarter decline in PRs per 1M tokens is a strong signal that Claude Code's contribution to your development workflow is eroding.\n\nWhere is efficiency being lost?\n\nOnce your headline metrics show a declining trend, these panels tell you which of the four efficiency drivers is responsible.\n\nCache Hit Rate\n\n**Metric:**`claude_code.token.usage`\n\n(`type = cacheRead`\n\n) ÷ (`claude_code.token.usage`\n\n(`type = input`\n\n) +`claude_code.token.usage`\n\n(`type = cacheRead`\n\n))\n\nCache hit rate is the first thing to check when efficiency ratios start declining. When prompt caching is working well, repeated context is served from cache at a fraction of the standard input token cost, meaning more of your token budget goes toward generating useful output rather than re-processing the same context.\n\nA falling cache hit rate is a direct drag on every output per token ratio. Common causes:\n\n- Frequently changing system prompts\n- Sessions too short to benefit from cache warming\n- Context being restructured between requests in a way that invalidates the cache\n\nA sudden drop from a previously stable level is a strong signal that something in your workflow changed.\n\nInput Tokens Per Session Over Time\n\n**Metric:**`claude_code.token.usage`\n\n(`type = input`\n\n) ÷ count(`session.id`\n\n)\n\nThis panel tracks how heavy the average session is becoming over time. As sessions grow longer, input tokens compound on every subsequent request. That extra context cost dilutes your output per token ratios without contributing anything to output.\n\nThis is one of those panels where the trend matters far more than the absolute number:\n\n**Flat or stable line**- healthy; session weight is being managed well** Trending upward**- context bloat is setting in. Possible causes:- Long sessions left open instead of starting fresh\n- Large files loaded into context repeatedly across requests\n- Compaction not triggering often enough to trim history\n\nSubagent Token Spend vs Main\n\n**Metric:**`claude_code.token.usage`\n\n(`query_source = subagent`\n\n) stacked against`claude_code.token.usage`\n\n(`query_source = main`\n\n)\n\nWhen Claude Code delegates work to subagents via the Task tool, each subagent makes its own independent API calls, complete with their own input context, output generation, and cache behavior. A single user prompt that triggers a multi-step agentic workflow can result in many background API calls, consuming tokens that don't map directly to visible output in your lines of code or commit metrics.\n\nThis panel makes that dynamic visible. If subagent spend is growing as a share of total token consumption while your output ratios are declining, your agentic workflows are consuming an increasing portion of your token budget without producing proportional output. That's a direct efficiency leak worth investigating at the workflow level.\n\nTool Edit Rejection Rate\n\n**Metric:**`claude_code.code_edit_tool.decision`\n\n(`decision = reject`\n\n) ÷ (`claude_code.code_edit_tool.decision`\n\n(`decision = accept`\n\n) +`claude_code.code_edit_tool.decision`\n\n(`decision = reject`\n\n))\n\nEvery rejected edit represents tokens spent generating output that contributed nothing. A stable, low rejection rate is healthy. A rising rejection rate means an increasing share of your token spend is producing edits that get thrown away.\n\nThis panel is particularly useful for distinguishing between two types of efficiency degradation:\n\n**Token efficiency degradation**— more tokens consumed per edit** Output quality degradation**— more edits rejected regardless of token count\n\nIf your lines of code per token is falling but your rejection rate is rising, the quality of Claude Code's output is likely the primary driver, not context bloat or cache issues.\n\nYou can also break this down by `language`\n\nattribute to see if rejection rates are higher for specific programming languages. This is useful for identifying whether degradation is general or concentrated in a particular part of your codebase.\n\nWhat specifically is causing it?\n\nOnce you've identified which efficiency driver is responsible, these panels help you pinpoint the exact session or failure mode.\n\nMost Expensive Sessions by Output Ratio\n\n**Metric:**`claude_code.token.usage`\n\nsummed by`session.id`\n\nand`user.email`\n\n, cross-referenced against`claude_code.lines_of_code.count`\n\nby`session.id`\n\nThis is your investigation starting point. A session-level table that shows tokens consumed alongside lines of code produced lets you immediately spot sessions with a high token cost and low output. The clearest signature of efficiency degradation at the session level.\n\nAdd `user.email`\n\nas a grouping dimension so each row tells you who ran the session. What you're looking for are sessions consuming significantly more tokens than the median while producing the same or less output. These are your highest-priority investigation targets. Understanding what happened in those sessions will usually point you directly at the underlying cause.\n\nAPI Retry Cost Waste\n\n**Log (Event):**`claude_code.api_retries_exhausted`\n\n(`total_attempts`\n\n)\n\nRetries are a silent efficiency drain. When an API request fails and Claude Code retries it, each attempt consumes tokens while contributing zero output. If your output per token ratios are declining and retry exhaustion events are spiking at the same time, retries may be responsible for a significant portion of the drop.\n\nA flat line close to zero is normal. Spikes indicate periods where requests were consistently failing and being retried, burning tokens on each attempt before eventually either succeeding or giving up.\n\nGoing Further: Alerting on Efficiency\n\nDashboards tell you what happened. Alerts tell you when it's happening. For efficiency monitoring, the most valuable alerts are thresholds on your output per token ratios, configured to fire when a ratio drops below a baseline you've established from your historical data.\n\nA starting point for threshold-based alerts. Adjust these once you've accumulated two to four weeks of baseline data:\n\n| Signal | Alert condition | Urgency |\n|---|---|---|\n| Lines of code per token | Drops more than 20% below your 7-day baseline | Warning |\n| Cache hit rate | Falls below 60%, or drops more than 15 points week-over-week | Warning |\n| Edit rejection rate | Rises above 30%, or increases more than 10 points week-over-week | Warning |\n`api_retries_exhausted` events | Any spike meaningfully above your baseline | Critical |\n\nFor orgs managing Claude Code across multiple teams, `OTEL_RESOURCE_ATTRIBUTES`\n\nlets you track efficiency ratios by team or department:\n\n```\nexport OTEL_RESOURCE_ATTRIBUTES=\"department=engineering,team.id=platform\"\n```\n\nThis enables team-level efficiency dashboards, making it possible to see whether degradation is org-wide or concentrated in a specific team's workflow, which narrows the scope when something goes wrong.\n\nConclusion\n\nSo is Claude Code actually as productive as it used to be? You can answer quantitatively. Output per token ratios give you a precise, measurable signal that anecdotal complaints never can. A declining ratio is actionable. A feeling that things are just worse is not.\n\nThe OTel telemetry Claude Code exposes makes this straightforward to set up. The same pipeline that powers usage/cost monitoring gives you everything you need for efficiency monitoring. It's a question of which metrics you choose to derive and which panels you choose to build.\n\nThe most important thing is to start tracking before you need the data. By the time efficiency degradation is visible, you've already lost weeks of signal that would have told you what changed and when. Start the telemetry pipeline now, establish your baseline, and you'll have the data you need to answer the question the next time someone says Claude Code doesn't feel as sharp as it used to.\n\nThe full list of metrics, logs, and attributes Claude Code exports is documented in the [Claude Code monitoring documentation](https://code.claude.com/docs/en/monitoring-usage/).", "url": "https://wpnews.pro/news/is-claude-code-getting-worse-how-to-measure-degradation-with-opentelemetry", "canonical_source": "https://signoz.io/blog/claude-code-measure-degradation-opentelemetry/", "published_at": "2026-05-26 13:16:23+00:00", "updated_at": "2026-05-26 13:40:57.864897+00:00", "lang": "en", "topics": ["ai-tools", "large-language-models", "generative-ai", "ai-products", "mlops"], "entities": ["Claude Code", "OpenTelemetry"], "alternates": {"html": "https://wpnews.pro/news/is-claude-code-getting-worse-how-to-measure-degradation-with-opentelemetry", "markdown": "https://wpnews.pro/news/is-claude-code-getting-worse-how-to-measure-degradation-with-opentelemetry.md", "text": "https://wpnews.pro/news/is-claude-code-getting-worse-how-to-measure-degradation-with-opentelemetry.txt", "jsonld": "https://wpnews.pro/news/is-claude-code-getting-worse-how-to-measure-degradation-with-opentelemetry.jsonld"}}