Your Agent Logs Are Lying to You: What to Actually Trace in an Agentic System

An engineer has identified a critical gap in observability for AI agents: standard application logs fail to capture the model's decision-making process. The engineer proposes a trace-based system that records each step of an agent's reasoning, including model invocations and tool calls, to enable debugging of incorrect behavior. A minimal TypeScript tracer implementation is provided to illustrate the approach.

Here is a debugging session I have watched play out at four different companies now. An agent does something dumb in production. A user complains. An engineer opens the logs. They find this: INFO agent.run started INFO calling tool: search INFO calling tool: fetch document INFO agent.run completed in 14.2s And that is it. That is everything. The agent burned 14 seconds, made three model calls, fetched the wrong document, and confidently told the user something false — and the logs have nothing to say about why . The engineer shrugs, marks the ticket "could not reproduce," and moves on. The bug ships forever. The problem is not that they forgot to log. They logged plenty. The problem is they logged the wrong layer. Application logs are a record of what your code did. An agent's behavior does not live in your code — it lives in the gap between your code and the model's decisions. That gap is invisible to console.log . In a normal service, the interesting events are deterministic. A request comes in, you branch on some conditions, you hit a database, you return a response. If you log the branches and the query, you can reconstruct what happened. The control flow is the explanation. Agents invert this. Your control flow is trivial — usually a while loop that calls the model, executes whatever tool the model asked for, and feeds the result back. All of the actual decision-making happens inside the model, expressed as tokens you never wrote. When the agent goes wrong, the answer is never "the loop had a bug." The answer is in the content : what was in the context window, what the model chose, what the tool returned, how the model interpreted that return. So the unit of observability for an agent is not the log line. It is the step : one full turn of perceive, decide, act. And steps nest — a sub-agent's steps live inside a parent step, a tool call may itself trigger a model call. You need a tree, not a stream. This is exactly the trace-and-span model from distributed tracing, and it maps onto agents shockingly well. For every model invocation, you want the things that let you replay the decision without rerunning it. At minimum: For every tool call: the arguments the model produced, the result you returned to it, whether it errored, and how long it took. The tool result is the single most overlooked field, because that text re-enters the context and steers everything after it. Garbage in a tool result is the most common root cause of a confidently wrong final answer, and it is invisible unless you store it. Here is a minimal tracer in TypeScript. The shape matters more than the implementation: type StepKind = "model" | "tool"; interface Step { id: string; parentId: string | null; kind: StepKind; name: string; input: unknown; // resolved messages or tool args output: unknown; // raw completion or tool result startedAt: number; endedAt?: number; tokensIn?: number; tokensOut?: number; error?: string; meta: Record<string, unknown ; // model, temperature, etc. } class Trace { readonly steps: Step = ; private stack: string = ; begin kind: StepKind, name: string, input: unknown, meta = {} : string { const id = crypto.randomUUID ; this.steps.push { id, parentId: this.stack.at -1 ?? null, kind, name, input, output: undefined, startedAt: Date.now , meta, } ; this.stack.push id ; return id; } end id: string, patch: Partial<Step : void { const step = this.steps.find s = s.id === id ; if step Object.assign step, patch, { endedAt: Date.now } ; if this.stack.at -1 === id this.stack.pop ; } } The parentId plus the stack is the whole trick. You get a tree for free, and a sub-agent just pushes more steps onto the same trace. Wrap your model client and your tool dispatcher so this happens automatically — if instrumenting requires discipline at every call site, it will rot within a month. async function tracedModelCall trace: Trace, messages: Message , model: string { const id = trace.begin "model", model, messages, { model } ; try { const res = await client.chat { model, messages } ; trace.end id, { output: res, tokensIn: res.usage.prompt tokens, tokensOut: res.usage.completion tokens, } ; return res; } catch err { trace.end id, { error: String err } ; throw err; } } Capturing the trace is half the job. The half that actually pays off is being able to ask questions across traces . "Show me every run where a tool returned an empty result and the final answer still claimed success." "Which model version started producing 3x the tool calls last Tuesday?" "What did the context window look like for the five worst-rated responses this week?" None of those are answerable from a log file. They require treating each trace as structured, queryable data — which means a real schema, indexed fields, and ideally a way to attach evaluation scores and user feedback onto the same trace. The moment you can join "this trace failed our eval" to "here is the exact resolved input that caused it," debugging stops being archaeology and becomes a query. This is also where observability and evaluation stop being separate concerns. An eval failure is just a trace with a verdict attached. A production incident is a trace with a bad outcome. They are the same object viewed from two directions, and the teams who treat them as one thing move dramatically faster. If you build one thing this quarter for your agents, build the trace tree. Not more INFO lines — a structured, nested record of every model and tool step, with the resolved inputs and raw outputs intact, that you can query and score after the fact. Everything else in agent reliability gets easier once you can actually see what happened. This is the philosophy behind the tooling I work on: agent-eval https://github.com/ for turning those traces into pass/fail verdicts in CI, and AgentLens for keeping the same traces searchable once the agent is live in production. Whether you adopt those or roll your own, the principle holds — your agent's behavior lives in the steps, so that is what you have to capture. Log the decisions, not the function calls.