{"slug": "your-ai-agent-said-done-here-s-how-i-found-out-it-actually-failed-three-hours", "title": "Your AI Agent Said 'Done.' Here's How I Found Out It Actually Failed Three Hours Later.", "summary": "An engineer discovered that their AI agent sent 47 incorrect pricing emails to customers, taking three hours before a human noticed. The agent logged 'Done' but failed silently, highlighting a production observability gap. The engineer proposes a minimal observability stack with trace IDs, tool call tracking, and runtime validation to catch such failures.", "body_md": "Last Tuesday my AI agent sent 47 incorrect pricing emails to active customers. It took three hours before a human noticed and flagged it. By then the damage was done — three customers had already replied, confused, one had escalated to support.\n\nThe agent had completed its task. It logged \"Done.\" It moved on. The failure happened silently, in a place the logs never looked.\n\nThis is the production observability gap nobody talks about in demos. We spend enormous effort making agents do more. We spend almost nothing making sure we know when they've failed.\n\nMost agent frameworks give you logs. What you need is observability — the ability to ask \"did the agent actually accomplish what I asked, and how do I know?\"\n\nThere's a structural reason this is hard. When a traditional service fails, you get an exception, a stack trace, a 500 error. When an AI agent fails, it usually fails silently — it produces a plausible but wrong output, or it uses the wrong tool, or it completes a task that was itself based on stale data. The error is in the output, not in the execution.\n\nHere is the pattern I see repeatedly:\n\n```\n# What most people ship\nresult = agent.run(task)\nlogger.info(f\"Agent completed: {result}\")\n```\n\nThat tells you the agent ran. It tells you nothing about whether the result is correct, whether the agent did what you expected, or whether it even worked on the right problem.\n\nThree hours of wasted time could have been avoided with a 15-minute observability layer.\n\nAfter shipping a half-dozen agent systems in production, I've settled on a minimal observability stack that covers the failure modes I actually encounter. It has four components.\n\nEvery agent session gets a UUID the moment it starts. Every log line, every tool call, every model response includes that ID. Without this, you cannot correlate what happened across a multi-step agent run.\n\n``` python\nimport contextvars\nimport uuid\nfrom datetime import datetime\n\ntrace_id_var = contextvars.ContextVar(\"trace_id\", default=None)\n\nclass AgentSession:\n    def __init__(self, task_description: str):\n        self.trace_id = str(uuid.uuid4())[:8]\n        self.task = task_description\n        self.started_at = datetime.utcnow()\n        self.tool_calls = []\n        self.steps = []\n\n    def log(self, level: str, message: str, **kwargs):\n        print(json.dumps({\n            \"ts\": datetime.utcnow().isoformat(),\n            \"trace_id\": self.trace_id,\n            \"level\": level,\n            \"message\": message,\n            **kwargs\n        }))\n```\n\nPass this session object through every tool call. When something breaks, you grep for the trace ID and get the full execution history.\n\nThe most common failure mode isn't \"the agent crashed\" — it's \"the agent called the right tool with the wrong parameters\" or \"the agent called a reasonable tool but the output was garbage.\"\n\nCapture both the call and a structural validation of the response, without slowing down the agent:\n\n``` python\ndef tracked_tool_call(session: AgentSession, tool_name: str, params: dict, result):\n    session.tool_calls.append({\n        \"tool\": tool_name,\n        \"params\": params,\n        \"result_preview\": str(result)[:200],  # first 200 chars\n        \"timestamp\": datetime.utcnow().isoformat()\n    })\n\n    # Validate response shape if we know what to expect\n    if tool_name == \"send_email\":\n        if not isinstance(result, dict) or \"message_id\" not in result:\n            session.log(\"ERROR\", \"email_tool_returned_unexpected_shape\", \n                       trace_id=session.trace_id, result=str(result)[:100])\n\n    return result\n```\n\nThis catches the class of failure where the tool runs but returns something useless — without blocking the agent's execution.\n\nBefore the agent marks a task complete, run a set of assertions against the output. These are not unit tests — they are runtime checks against the actual result.\n\n``` php\ndef validate_completion(session: AgentSession, task: str, result) -> bool:\n    checks = [\n        (\"result_is_string\", lambda: isinstance(result, str) and len(result) > 10),\n        (\"no_obvious_hallucination_markers\", lambda: not any(\n            phrase in str(result).lower() \n            for phrase in [\"as an ai\", \"i cannot\", \"i'm sorry\", \"undefined\"]\n        )),\n    ]\n\n    passed = []\n    for name, check_fn in checks:\n        try:\n            ok = check_fn()\n            passed.append({\"check\": name, \"ok\": ok})\n            if not ok:\n                session.log(\"WARN\", f\"completion_check_failed\", check=name, \n                           trace_id=session.trace_id)\n        except Exception as e:\n            passed.append({\"check\": name, \"ok\": False, \"error\": str(e)})\n            session.log(\"ERROR\", f\"completion_check_error\", check=name,\n                        error=str(e), trace_id=session.trace_id)\n\n    return all(p[\"ok\"] for p in passed)\n```\n\nThese checks won't catch every failure. But they catch the silent failures — the ones where the agent produces a confident nonsense answer that looks legitimate until you read it.\n\nEvery agent run gets a time budget and a token budget. When either is exceeded, the agent stops — even if it hasn't reached a conclusion. This prevents the \"agent runs for 20 minutes and outputs nothing useful\" failure mode.\n\n``` python\nMAX_DURATION_SECONDS = 120\nMAX_TOKENS = 8000\n\ndef run_with_budget(agent, task: str) -> str:\n    start = time.time()\n    result = agent.run(task)\n    elapsed = time.time() - start\n\n    # Budget exceeded — fail loudly\n    if elapsed > MAX_DURATION_SECONDS:\n        logger.error(f\"Budget exceeded: {elapsed:.1f}s > {MAX_DURATION_SECONDS}s\")\n        raise TimeoutError(f\"Agent exceeded time budget\")\n\n    return result\n```\n\nThe email incident taught me three things I now build into every agent from day one.\n\n**Fail loudly at decision points.** The email agent had a confidence threshold — below it, it was supposed to escalate to a human. The threshold existed in the prompt. The model ignored it on Tuesday morning, probably because the query phrasing triggered a confident-but-wrong path. Prompts are not contracts. Hard-code critical business logic outside the model's control.\n\n**Correlate tool calls with outcomes.** The email tool logged that it sent 47 emails. It did not log *why* those were the 47 emails it chose. When I investigated, I found the agent had selected the customer list using a query that returned stale data. The tool worked perfectly. The data pipeline feeding it was broken. Without the trace context linking the tool call to the query that preceded it, I would have blamed the email tool.\n\n**Your eval suite will not catch this.** My agent had passed every eval test before deployment. The eval suite checked whether the agent could complete the task correctly when everything went right. It didn't check what happened when the upstream data was stale, when the model's confidence calibration was off, or when the agent encountered an ambiguous instruction. Production failure modes are not in your eval suite. You find them with observability.\n\nThe three hours I lost to that email incident cost more than the 15 minutes it would have taken to add the observability layer. That's the math I keep relearning.\n\nIf you're shipping AI agents in production and you're not logging trace IDs, validating tool call outputs, checking completion criteria, and budgeting execution time — you are running the same experiment I ran, and you'll learn the same lesson I learned. The agent will tell you it's done. It won't tell you if it failed.\n\n*What's your most painful production agent failure story? I'd love to hear what you learned — find me on DEV.to or drop a comment below.*", "url": "https://wpnews.pro/news/your-ai-agent-said-done-here-s-how-i-found-out-it-actually-failed-three-hours", "canonical_source": "https://dev.to/mrclaw207/your-ai-agent-said-done-heres-how-i-found-out-it-actually-failed-three-hours-later-5b74", "published_at": "2026-07-01 13:05:25+00:00", "updated_at": "2026-07-01 13:18:48.439698+00:00", "lang": "en", "topics": ["ai-agents", "ai-safety", "developer-tools"], "entities": [], "alternates": {"html": "https://wpnews.pro/news/your-ai-agent-said-done-here-s-how-i-found-out-it-actually-failed-three-hours", "markdown": "https://wpnews.pro/news/your-ai-agent-said-done-here-s-how-i-found-out-it-actually-failed-three-hours.md", "text": "https://wpnews.pro/news/your-ai-agent-said-done-here-s-how-i-found-out-it-actually-failed-three-hours.txt", "jsonld": "https://wpnews.pro/news/your-ai-agent-said-done-here-s-how-i-found-out-it-actually-failed-three-hours.jsonld"}}