You Can’t Monitor an AI Agent Like a Web Service. Here’s What I Track Instead.

A production incident involving an AI agent went undetected because traditional web service monitoring failed to catch a prompt cache invalidation that doubled costs and degraded answer quality while returning 200 OK status codes. The author argues that agents require specialized monitoring focused on token-level latency, cost per token, and output quality, not just request counts and error rates.

The worst production incident I’ve had with an agent never triggered an alert. No 500s. No latency spike. No error rate climbing on a dashboard. The API gateway was happily emitting 200s the entire time. What actually happened was that a prompt tweak — a tiny reordering of the tool list, nothing dramatic — had quietly invalidated the prompt cache, and the agent had started answering around its retrieved context instead of from it. Cost per task roughly doubled. Answer quality dropped. The infrastructure reported perfect health throughout. I only found out because a user told me. That’s the worst way to learn your agent is broken. This is the thing nobody tells you when you ship your first agent: the monitoring you inherit for free — the uptime, the error rate, the p95 latency that comes bundled with your gateway — was designed for web services. And an agent is not a web service. The failures that kill agents in production are precisely the ones that return 200 OK. The way I’ve come to structure this is around five questions. Is it fast? Can it scale? Is it correct? Does it hold up? And, once there’s an agent in the loop, how does it behave? Each one maps to a handful of metrics, and the useful insight is that some of those metrics arrive for free while the most important ones don’t exist until you build them. What follows is what I actually instrument across the agents I run in production — Wasaphi, Izimail, client systems — connected to the architecture I keep writing about here. If you’ve read Deterministic Shells, Probabilistic Cores https://medium.com/towards-artificial-intelligence/deterministic-shells-probabilistic-cores-the-architecture-pattern-behind-every-reliable-agent-a5de28e36bd0 , you already know my central claim: an LLM is a probabilistic core, and the engineering lives in the deterministic shell around it. Monitoring is the same story. You can’t observe a probabilistic core with tools built for deterministic web traffic, for three concrete reasons. Latency isn’t one number anymore. A web request has a duration. An LLM response is generated token by token, so “latency” is at least three different numbers depending on where you stand on the timeline. A single global p95 averages a 50-token price check with a 2,000-token report. It describes none of your actual workloads. Cost scales with tokens, not requests. This is the whole reason I keep saying your agent isn’t intelligent — it’s a while loop with a credit card https://medium.com/towards-artificial-intelligence/your-agent-isnt-intelligent-it-s-a-while-loop-with-a-credit-card-47757f0cef4f . Every iteration of that loop is an LLM call, and every call burns tokens. Request count tells you almost nothing about spend. A request-per-second graph looks identical whether each request costs you a tenth of a cent or fifty cents. The damaging failures are silent. A quality regression doesn’t throw an exception. It returns confident, well-formatted text with a 200 status code. Nothing in your serving stack knows the difference between a correct answer and a plausible lie. That last point is the one that should keep you up at night. The monitoring you get for free covers exactly the dimensions where agents don’t tend to fail, and stays silent on exactly the dimensions where they do. So let me walk through what I actually instrument, grouped by the question each metric answers. Every latency metric worth tracking is a position on a timeline. The request has two phases: prefill , where the model ingests your entire prompt and builds its internal state while the user stares at nothing, and decode , where it generates output one token at a time. Time to first token TTFT is queueing plus prefill — the exact duration of that blank screen. In a streaming UI, this is the number users actually feel. And here’s the part that bites RAG systems specifically: TTFT grows with prompt length. Every chunk you stuff into the context window to “help” the model is paid back as perceived slowness before a single output token appears. I wrote a whole article on why context is something you engineer, not something you dump https://medium.com/towards-artificial-intelligence/context-engineering-what-i-changed-in-my-agents-and-why-prompt-engineering-isnt-enough-c505511b5833 — TTFT is the metric that puts a price tag on getting that wrong. Inter-token latency is the gap between tokens once streaming starts. It’s what makes output read as flowing text versus a stuttering freeze. Users tolerate a slow-but-steady stream far better than a fast one that stalls mid-sentence — which is exactly why I stream everything in Wasaphi, including the “thinking” and tool-call indicators. Perceived latency is a real metric, and it’s the one you’re optimizing. End-to-end latency, per use case. Output length dominates the full span, so track it per workload. A “what’s the price of AAPL?” check and a full multi-stock analysis should never share a percentile. And then agents add their own twist: latency compounds across the loop. A task that chains five sequential LLM calls multiplies the per-call number. A perfectly tolerable per-call p95 becomes an intolerable task-level wait. The moment you go multi-agent and start handing tasks over A2A, this gets worse — I noted in One Agent, Many Agents, or Something In Between https://medium.com/towards-artificial-intelligence/one-agent-many-agents-or-something-in-between-a-decision-framework-for-agent-architecture-8acfaa4a1617 that each hop adds 300–800ms minimum just for the network round-trip and the fresh context load on the other side. So set your latency budget at the task level and let it constrain the steps, not the other way around. Not cost per request. Cost per successful task. This distinction is the single most important number in the whole framework, and it’s the one almost nobody logs. Cost per request makes a system that fails cheaply look fantastic. A request that costs a third as much but succeeds half as often is more expensive where it counts — and the failed attempts trigger retries that multiply your spend invisibly. I learned this the expensive way. Early Wasaphi pulled 50 Reddit posts with 100 comments each into a single analysis. That’s roughly a million tokens per request. My API bill after one day of testing was $47, and most of those tokens were producing garbage the agent couldn’t even finish reasoning about before it hit its iteration limit. Cost per request looked alarming. Cost per successful task was effectively infinite, because nothing succeeded. A few things to actually log here: Input and output tokens, per request, broken down by use case. This is the raw material of your unit economics, and output tokens are usually priced at a multiple of input. There’s nothing clever about it — you just have to log both, every time, with a use-case tag. Cache hit rate. Prompt caching is one of the largest single cost levers you have if your system has a stable prefix — a fixed system prompt, tool definitions, shared document context. But caching matches the prefix byte for byte, so a timestamp interpolated near the front, or a reordered tool list, silently invalidates it. This is exactly the failure I opened this article with. A falling cache hit rate is often the first sign that someone changed how prompts get assembled. If you build your system prompt dynamically — and you should — this metric is your canary. Cost per successful task is also the connective number. You can push latency, cost, or quality in isolation and feel productive. This is the one metric that tells you whether the system as a whole actually got better or just moved the problem somewhere you weren’t looking. Here’s the uncomfortable part. Every metric in the first two groups is, more or less, a field on a response object or a timestamp you already have. Correctness emits nothing. The serving stack has no idea whether the answer was right. If you want to know, you build the measurement yourself — every piece of it. A labeled eval set with a task success rate. This is the regression suite of an AI system, and it’s the highest-leverage thing on this entire list. It does not need to be big. Fifty to a few hundred examples with known-good outcomes, re-run on every prompt change, every model swap, every retrieval tweak. The discipline matters more than the size: the set has to be representative and it has to be maintained. Without it, every change you ship is a guess. Groundedness, for RAG. Is the answer supported by what you retrieved, or invented in the gap between retrieval and generation? You can have excellent retrieval and still hallucinate in the synthesis step. This matters less with frontier models, but the entire economic point of a good RAG pipeline is to let you run the smallest model you can get away with — and smaller models drift from their context more. So you measure it. Retrieval precision and recall. When answer quality drops, check retrieval before you start rewriting prompts. Generation cannot fix what retrieval never surfaced. Retrieval is still the harder problem in most systems I see, and prompt-tinkering is the seductive wrong place to spend your afternoon. LLM-as-judge, calibrated against human labels. A judge makes quality measurable at volume, which is the only way to measure it at all once you’re past a handful of examples. But judges drift — upgrade the judge model and your scores move with no change to the system under test. Calibrate against a human-labeled sample at the start, and re-calibrate whenever the judge changes or the trend suddenly looks too good to be true. User behavior signals. Explicit thumbs up/down are rare and biased toward the extremes — people rate when they’re delighted or furious, not when something is quietly mediocre. The honest signals are behavioral: regeneration rate, and how heavily users edit your output before they accept it. Those get recorded at the moment the user acts on the answer, not when they feel like filling out a survey. Error, timeout, and rate-limit rates, split per provider. Aggregate availability hides which dependency is degrading. Rate limits in particular arrive in bursts tied to one provider’s quota, and you want to know which one before it cascades. Retry and fallback rate — and the trap inside it. Falling back to a backup model keeps your availability dashboard green while quality quietly changes underneath. This is the dangerous interaction. If you’ve built a model router — and I argue your users will never pick the right model, so you should build one https://medium.com/towards-artificial-intelligence/your-users-will-never-pick-the-right-model-build-a-router-instead-2ee39a86f702 — then your fallback path is routing real traffic to a different model than you think. Track your quality metrics per serving path , not just in aggregate, or your router will hide a regression behind a healthy uptime number. Guardrail and refusal rates. A spike in either means user behavior drifted, someone’s probing you with an injection attempt, or a model change moved the refusal boundary. All three are worth knowing about, and once your guardrail actions are logged properly, this is nearly free to track. This is the group that single LLM calls don’t have, and it’s where my own obsession lives. An agent makes a sequence of decisions, and each one can quietly degrade. The metrics here all come from one source: the trajectory log. If you’re not logging the full trajectory of every agent run, start there before anything else on this list. Tool-call error rate, split by cause. This split is the whole point. A schema or argument error means the model misunderstood the tool — the fix is in the tool description , which, as I never stop saying, is part of your prompt. An execution error means the tool itself broke — the fix is in your code . Same symptom, completely different bug. Lumping them together throws away the only information that tells you where to look. Steps and tokens per completed task. Watch this religiously after every model swap and every prompt change. If success rate holds steady while steps-per-task climbs, your cost is rising while your accuracy isn’t. That’s a silent margin leak, and it’s the agentic version of the iteration-budget problem that strangled early Wasaphi — it would burn 80% of its iterations gathering data and have nothing left for the actual analysis. Context window utilization. This is your early warning for compaction and truncation problems. Quality usually starts degrading while the context fills, well before anything visibly breaks, so a utilization trend lets you intervene while the failure is still invisible to users. Given how much I bang on about context engineering, it should be no surprise I treat this as a first-class metric. Loop detection. The rate at which the agent repeats an identical tool call with identical arguments. A loop is pure token burn — a credit card spinning with nobody driving — and it’s trivially detectable from the trajectory log with a few lines of code. Step back and look at the five groups together, and a pattern jumps out that reframes the whole problem. Latency and reliability show up on day one. The gateway, the load balancer, and the provider SDK emit them as a side effect of just serving traffic. You get them for free. Cost per task, correctness, and agent behavior are produced by nothing until you build the pipeline that measures them. Token and cache fields logged from every response. An eval set that runs on every change. A judge with calibration. Trajectory logs for agents. Here’s why thorough-looking monitoring and giant blind spots coexist so comfortably: the monitoring covers the dimensions the infrastructure happens to report, and the failures live in the dimensions it doesn’t. Your dashboard is honest. It’s just measuring the wrong things. The practical consequence is a budgeting decision, not a tooling one. Instrumentation for cost-per-task, quality, and agent behavior belongs in the initial build estimate for an AI feature — the same way you’d budget for auth or error handling. It is not a “we’ll add observability in a later sprint” line item, because by the time you reach that sprint you’ll have been flying blind through the exact failures that matter most. You don’t implement all of this at once — that’s a great way to never ship. Ordered by information-per-effort: This list is representative, not exhaustive. It deliberately skips the GPU-level serving metrics that only matter if you host your own inference, plus embedding drift and business-layer numbers like containment rate. Different article. Here’s the part that makes all of this less daunting than it looks. If you built your agent the way I keep arguing you should — a probabilistic core inside a deterministic shell — then the shell already controls everything you need to measure. It assembles the context, so it knows the prompt length and the cache prefix. It routes the model, so it knows which serving path ran. It validates and executes tool calls, so it has the trajectory. It manages the context window, so it knows the utilization. Every number in this article passes through the shell already. Monitoring isn’t a separate system you bolt on. It’s the shell logging what it’s already handling. You wrapped the model in deterministic code to make its brilliance safe. Logging what crosses that boundary is the same act, viewed from the other side. The shell defines the arena. Observability is how you watch the performance — and notice, before your users do, when the core has quietly stopped performing. Your dashboard can keep saying 200 OK. You'll know better. Thanks for reading I’m Elliott, a Python & Agentic AI consultant and entrepreneur who builds practical AI tools and shares what actually works in production — usually after learning the expensive way so you don’t have to. I write weekly about the agents I build, the architecture behind them, and the patterns that survive contact with real users. If this gave you a metric you weren’t tracking, smash that clap button 👏 and follow for more honest takes on agent architecture. And if your dashboard is green right now — when’s the last time you checked cost per successful task? Tell me in the comments. You Can’t Monitor an AI Agent Like a Web Service. Here’s What I Track Instead. https://pub.towardsai.net/you-cant-monitor-an-ai-agent-like-a-web-service-here-s-what-i-track-instead-4746d5910345 was originally published in Towards AI https://pub.towardsai.net on Medium, where people are continuing the conversation by highlighting and responding to this story.