{"slug": "four-forensics-when-a-production-ai-agent-fails", "title": "Four forensics when a production AI agent fails", "summary": "The article describes a common failure pattern in production AI agents, where multiple distinct issues—such as degraded external dependencies, faulty validation gates, and cost spikes from specific user queries—compound to appear as a single \"broken agent\" problem. It outlines a forensic approach using traces to diagnose these failures, citing a real case where an LLM provider's quiet model update increased agent iterations and latency, while permissive validation thresholds and a few high-cost users further degraded performance. The recommended fixes include tightening prompts, adjusting validation thresholds, and implementing per-user cost caps.", "body_md": "A founder messaged us at 11pm on a Friday: \"Our agent is broken. Customers are complaining. My on-call engineer has no idea where to start. Can you help?\"\nThe agent was a customer support tool that had launched the previous Monday. By Friday evening, the company's support inbox had filled with users reporting that the AI was giving wrong answers, taking forever to respond, or just timing out. The engineering team was treating it as one big problem. It was actually four problems stacked on top of each other.\nThis is the failure pattern most production agent teams hit at some point. The symptoms compound, the team panics, and they start trying random fixes. Here is the forensics order Sapota walks through, and the four most common failure modes that account for the majority of post-launch incidents.\nBefore debugging anything else, look at the traces. If your agent is in production without traces, that is the first problem to solve, even mid-incident. Pull a request that is failing, look at the trace, and see where the time is being spent and what is failing.\nThe pattern we look for in the trace:\nIn the founder's case, the traces showed three different failure patterns appearing in the same week. The team had been treating them as one problem because the customer-facing symptom was the same: \"the AI is broken.\"\nThe most common production agent failure is an external dependency getting slower or less reliable. The agent itself is fine; the world around it changed.\nCommon culprits:\nThe diagnostic: check your tool latency and error metrics for the past week. If any tool's p95 latency is 2x what it was at launch, or its error rate is up more than 1%, that is the candidate.\nThe fix depends on the specific dependency. Rate limits: upgrade your tier or implement exponential backoff. Slow retrieval: tune the index or scale the database. API drift: update the integration. KB growth: re-tune chunking and retrieval parameters.\nIn the founder's case, the LLM provider had pushed a quiet model update on Wednesday. The new model interpreted the routing prompt slightly differently, causing the agent to loop more often before settling on an answer. Average iterations went from 2.3 to 4.1. Cost and latency both jumped. The fix was a tighter routing prompt with three more few-shot examples.\nThe opposite failure: a validation gate is supposed to be catching bad outputs, but it is not firing because the gate logic has a bug or the threshold is wrong.\nCommon patterns:\nThe diagnostic: look at a sample of bad responses customers reported. Trace what should have caught them. If a validation gate exists for that failure mode, check whether it actually fired.\nIn the founder's system, the faithfulness threshold was set at 0.7, which was permissive. We tightened it to 0.85, the rejection rate went from 2% to 9%, and the customer complaints about wrong answers dropped immediately. The \"rejected\" responses were replaced with honest \"I cannot find that in our knowledge base\" messages, which users preferred to wrong answers.\nProduction query distribution is different from test distribution. Specific query patterns can be much more expensive than the average, and a few of those can dominate the bill.\nThe pattern: a small fraction of users (often 1-5%) generate a large fraction of cost (often 30-60%). Either through legitimate complex queries, abuse, or because their input triggers a degenerate code path in the agent.\nThe diagnostic: pull cost-per-user statistics for the last week. Sort descending. Look at the top 10 users. Are they sending normal queries? Or is one user looping their integration with bad inputs? Or is a specific query class (long documents, malformed input, multi-turn deep into rare topics) eating budget?\nThe fixes vary:\nIn the founder's system, two users were sending repeated multi-paragraph product comparison requests, generating about 40% of the daily cost between them. We added a per-user daily cost cap and a length limit on inputs. Cost dropped 35% within 48 hours. Neither user complained because both were testing internal features and the cap was generous enough for normal use.\nThe hardest failure to detect: nothing is broken, no errors, latency is fine, cost is normal. But the responses are getting worse. Customers complain, the team cannot reproduce, and the dashboards all look green.\nCauses:\nThe diagnostic: run your eval pipeline against the current production agent. Compare against the score from launch. If the score has dropped, you have quality drift. If the score is the same but customers are complaining, your eval set has gone stale.\nThe fix: refresh the eval set. Sample 50-100 actual production queries, write expected answers for each, run the eval, and tune from there. Most teams refresh eval sets quarterly. Teams in fast-moving domains do it monthly.\nThe four-hour Friday-night triage:\nCustomer complaints stopped within 72 hours. The team's mood went from \"we built a broken thing\" to \"we built a thing that needs operational rigor we did not anticipate.\" That second framing is the one that produces a better product.\nThe founder's team had no playbook for \"the agent is broken in production.\" They were debugging in panic mode, which slowed every step. After the incident, we wrote a one-page runbook with the four failure modes, the diagnostic for each, and the most common fixes.\nSix weeks later, when a similar issue happened (a tool API outage), the on-call engineer worked through the runbook, identified the cause in 20 minutes, applied the documented fix, and was done in an hour. No panic, no escalation, no Friday-night call to a consultant.\nThis is what production agent operations looks like at maturity. Not \"nothing ever goes wrong\" but \"when things go wrong, the team has a known process to find the cause.\"\nIf your team launched an AI agent and the first few weeks have been more painful than expected, the right intervention is usually a forensic audit, not more development. Most launch issues are not new bugs in the agent code. They are operational gaps that surface only at production scale.\nSapota offers a one-week post-launch audit that walks through traces, validation, dependencies, and quality drift, identifies which of the four failure modes is responsible for which symptoms, and ships fixes plus a runbook for future incidents. We have done this for half a dozen B2B SaaS clients in the first three months after their AI launches.\nReach out via the AI engineering page with a description of what your agent does and what kind of failures you are seeing. The first conversation is usually the diagnostic.", "url": "https://wpnews.pro/news/four-forensics-when-a-production-ai-agent-fails", "canonical_source": "https://dev.to/sapotacorp/four-forensics-when-a-production-ai-agent-fails-4in2", "published_at": "2026-05-24 02:51:28+00:00", "updated_at": "2026-05-24 03:04:50.483121+00:00", "lang": "en", "topics": ["artificial-intelligence", "large-language-models", "developer-tools", "startups", "enterprise-software"], "entities": ["Sapota"], "alternates": {"html": "https://wpnews.pro/news/four-forensics-when-a-production-ai-agent-fails", "markdown": "https://wpnews.pro/news/four-forensics-when-a-production-ai-agent-fails.md", "text": "https://wpnews.pro/news/four-forensics-when-a-production-ai-agent-fails.txt", "jsonld": "https://wpnews.pro/news/four-forensics-when-a-production-ai-agent-fails.jsonld"}}