{"slug": "your-agent-logs-are-lying-to-you-what-to-actually-trace-in-an-agentic-system", "title": "Your Agent Logs Are Lying to You: What to Actually Trace in an Agentic System", "summary": "An engineer has identified a critical gap in observability for AI agents: standard application logs fail to capture the model's decision-making process. The engineer proposes a trace-based system that records each step of an agent's reasoning, including model invocations and tool calls, to enable debugging of incorrect behavior. A minimal TypeScript tracer implementation is provided to illustrate the approach.", "body_md": "Here is a debugging session I have watched play out at four different companies now.\n\nAn agent does something dumb in production. A user complains. An engineer opens the logs. They find this:\n\n```\n[INFO] agent.run started\n[INFO] calling tool: search\n[INFO] calling tool: fetch_document\n[INFO] agent.run completed in 14.2s\n```\n\nAnd that is it. That is everything. The agent burned 14 seconds, made three model calls, fetched the wrong document, and confidently told the user something false — and the logs have nothing to say about *why*. The engineer shrugs, marks the ticket \"could not reproduce,\" and moves on. The bug ships forever.\n\nThe problem is not that they forgot to log. They logged plenty. The problem is they logged the wrong layer. Application logs are a record of what your *code* did. An agent's behavior does not live in your code — it lives in the gap between your code and the model's decisions. That gap is invisible to `console.log`\n\n.\n\nIn a normal service, the interesting events are deterministic. A request comes in, you branch on some conditions, you hit a database, you return a response. If you log the branches and the query, you can reconstruct what happened. The control flow *is* the explanation.\n\nAgents invert this. Your control flow is trivial — usually a `while`\n\nloop that calls the model, executes whatever tool the model asked for, and feeds the result back. All of the actual decision-making happens inside the model, expressed as tokens you never wrote. When the agent goes wrong, the answer is never \"the loop had a bug.\" The answer is in the *content*: what was in the context window, what the model chose, what the tool returned, how the model interpreted that return.\n\nSo the unit of observability for an agent is not the log line. It is the **step**: one full turn of perceive, decide, act. And steps nest — a sub-agent's steps live inside a parent step, a tool call may itself trigger a model call. You need a tree, not a stream. This is exactly the trace-and-span model from distributed tracing, and it maps onto agents shockingly well.\n\nFor every model invocation, you want the things that let you replay the decision without rerunning it. At minimum:\n\nFor every tool call: the arguments the model produced, the result you returned to it, whether it errored, and how long it took. The tool *result* is the single most overlooked field, because that text re-enters the context and steers everything after it. Garbage in a tool result is the most common root cause of a confidently wrong final answer, and it is invisible unless you store it.\n\nHere is a minimal tracer in TypeScript. The shape matters more than the implementation:\n\n```\ntype StepKind = \"model\" | \"tool\";\n\ninterface Step {\n  id: string;\n  parentId: string | null;\n  kind: StepKind;\n  name: string;\n  input: unknown;        // resolved messages or tool args\n  output: unknown;       // raw completion or tool result\n  startedAt: number;\n  endedAt?: number;\n  tokensIn?: number;\n  tokensOut?: number;\n  error?: string;\n  meta: Record<string, unknown>; // model, temperature, etc.\n}\n\nclass Trace {\n  readonly steps: Step[] = [];\n  private stack: string[] = [];\n\n  begin(kind: StepKind, name: string, input: unknown, meta = {}): string {\n    const id = crypto.randomUUID();\n    this.steps.push({\n      id,\n      parentId: this.stack.at(-1) ?? null,\n      kind, name, input,\n      output: undefined,\n      startedAt: Date.now(),\n      meta,\n    });\n    this.stack.push(id);\n    return id;\n  }\n\n  end(id: string, patch: Partial<Step>): void {\n    const step = this.steps.find((s) => s.id === id);\n    if (step) Object.assign(step, patch, { endedAt: Date.now() });\n    if (this.stack.at(-1) === id) this.stack.pop();\n  }\n}\n```\n\nThe `parentId`\n\nplus the stack is the whole trick. You get a tree for free, and a sub-agent just pushes more steps onto the same trace. Wrap your model client and your tool dispatcher so this happens automatically — if instrumenting requires discipline at every call site, it will rot within a month.\n\n```\nasync function tracedModelCall(trace: Trace, messages: Message[], model: string) {\n  const id = trace.begin(\"model\", model, messages, { model });\n  try {\n    const res = await client.chat({ model, messages });\n    trace.end(id, {\n      output: res,\n      tokensIn: res.usage.prompt_tokens,\n      tokensOut: res.usage.completion_tokens,\n    });\n    return res;\n  } catch (err) {\n    trace.end(id, { error: String(err) });\n    throw err;\n  }\n}\n```\n\nCapturing the trace is half the job. The half that actually pays off is being able to *ask questions across traces*. \"Show me every run where a tool returned an empty result and the final answer still claimed success.\" \"Which model version started producing 3x the tool calls last Tuesday?\" \"What did the context window look like for the five worst-rated responses this week?\"\n\nNone of those are answerable from a log file. They require treating each trace as structured, queryable data — which means a real schema, indexed fields, and ideally a way to attach evaluation scores and user feedback onto the same trace. The moment you can join \"this trace failed our eval\" to \"here is the exact resolved input that caused it,\" debugging stops being archaeology and becomes a query.\n\nThis is also where observability and evaluation stop being separate concerns. An eval failure is just a trace with a verdict attached. A production incident is a trace with a bad outcome. They are the same object viewed from two directions, and the teams who treat them as one thing move dramatically faster.\n\nIf you build one thing this quarter for your agents, build the trace tree. Not more `INFO`\n\nlines — a structured, nested record of every model and tool step, with the resolved inputs and raw outputs intact, that you can query and score after the fact. Everything else in agent reliability gets easier once you can actually *see* what happened.\n\nThis is the philosophy behind the tooling I work on: [agent-eval](https://github.com/) for turning those traces into pass/fail verdicts in CI, and AgentLens for keeping the same traces searchable once the agent is live in production. Whether you adopt those or roll your own, the principle holds — your agent's behavior lives in the steps, so that is what you have to capture. Log the decisions, not the function calls.", "url": "https://wpnews.pro/news/your-agent-logs-are-lying-to-you-what-to-actually-trace-in-an-agentic-system", "canonical_source": "https://dev.to/saurav_bhattacharya/your-agent-logs-are-lying-to-you-what-to-actually-trace-in-an-agentic-system-k8o", "published_at": "2026-06-13 04:49:23+00:00", "updated_at": "2026-06-13 05:17:26.025756+00:00", "lang": "en", "topics": ["ai-agents", "developer-tools", "large-language-models", "mlops"], "entities": ["TypeScript"], "alternates": {"html": "https://wpnews.pro/news/your-agent-logs-are-lying-to-you-what-to-actually-trace-in-an-agentic-system", "markdown": "https://wpnews.pro/news/your-agent-logs-are-lying-to-you-what-to-actually-trace-in-an-agentic-system.md", "text": "https://wpnews.pro/news/your-agent-logs-are-lying-to-you-what-to-actually-trace-in-an-agentic-system.txt", "jsonld": "https://wpnews.pro/news/your-agent-logs-are-lying-to-you-what-to-actually-trace-in-an-agentic-system.jsonld"}}