Your Agent Logs Are Lying to You: What to Actually Trace in an Agentic System An engineer has identified a critical gap in observability for AI agents: standard application logs fail to capture the model's decision-making process. The engineer proposes a trace-based system that records each step of an agent's reasoning, including model invocations and tool calls, to enable debugging of incorrect behavior. A minimal TypeScript tracer implementation is provided to illustrate the approach. Here is a debugging session I have watched play out at four different companies now. An agent does something dumb in production. A user complains. An engineer opens the logs. They find this: INFO agent.run started INFO calling tool: search INFO calling tool: fetch document INFO agent.run completed in 14.2s And that is it. That is everything. The agent burned 14 seconds, made three model calls, fetched the wrong document, and confidently told the user something false — and the logs have nothing to say about why . The engineer shrugs, marks the ticket "could not reproduce," and moves on. The bug ships forever. The problem is not that they forgot to log. They logged plenty. The problem is they logged the wrong layer. Application logs are a record of what your code did. An agent's behavior does not live in your code — it lives in the gap between your code and the model's decisions. That gap is invisible to console.log . In a normal service, the interesting events are deterministic. A request comes in, you branch on some conditions, you hit a database, you return a response. If you log the branches and the query, you can reconstruct what happened. The control flow is the explanation. Agents invert this. Your control flow is trivial — usually a while loop that calls the model, executes whatever tool the model asked for, and feeds the result back. All of the actual decision-making happens inside the model, expressed as tokens you never wrote. When the agent goes wrong, the answer is never "the loop had a bug." The answer is in the content : what was in the context window, what the model chose, what the tool returned, how the model interpreted that return. So the unit of observability for an agent is not the log line. It is the step : one full turn of perceive, decide, act. And steps nest — a sub-agent's steps live inside a parent step, a tool call may itself trigger a model call. You need a tree, not a stream. This is exactly the trace-and-span model from distributed tracing, and it maps onto agents shockingly well. For every model invocation, you want the things that let you replay the decision without rerunning it. At minimum: For every tool call: the arguments the model produced, the result you returned to it, whether it errored, and how long it took. The tool result is the single most overlooked field, because that text re-enters the context and steers everything after it. Garbage in a tool result is the most common root cause of a confidently wrong final answer, and it is invisible unless you store it. Here is a minimal tracer in TypeScript. The shape matters more than the implementation: type StepKind = "model" | "tool"; interface Step { id: string; parentId: string | null; kind: StepKind; name: string; input: unknown; // resolved messages or tool args output: unknown; // raw completion or tool result startedAt: number; endedAt?: number; tokensIn?: number; tokensOut?: number; error?: string; meta: Record