A 3-step agent cost me $4.20. agenttrace showed me the O(n ) tool call hiding in plain sight.

A simple three-step AI agent unexpectedly cost $4.20 due to a hidden bug in the cite-check step, where the model made nine tool calls instead of one because each iteration re-attached the full prior history, causing input tokens to grow quadratically. The author used a Rust crate called `agenttrace-rs` to aggregate LLM calls into runs and generate a by-step cost breakdown, which revealed the issue. After fixing the bug by implementing a sliding window instead of re-attaching full history, the same run cost only $0.14—about 30 times cheaper.

I ran a small agent. Three steps. One web search, one summarize, one cite-check. I had budgeted maybe 12 cents. The bill at the end of the run was $4.20. I knew something was off but the per-call invoice line items were not telling me anything useful. They were just a list of messages.create calls. I needed to group them into the run that produced them and look at the cost shape. That is the gap agenttrace-rs fills. It is a Rust crate that aggregates LLM calls into runs and gives you cost, latency, and a by-model breakdown. The breakdown that surfaced the bug js use agenttrace::{Trace, Run}; let mut trace = Trace::new ; let run = trace.start run "cite-check-agent" ; run.record call claude cost::estimate &req1, &resp1 ; run.record call claude cost::estimate &req2, &resp2 ; run.record call claude cost::estimate &req3, &resp3 ; // ... and so on for every tool result/follow-up step let summary = run.finish ; println "{}", summary.report ; The report it printed for the $4.20 run: run: cite-check-agent duration: 38.4s total cost usd: 4.2031 calls: 11 p50 latency ms: 2710 p95 latency ms: 4920 by-model: claude-opus-4-7: 9 calls $4.1880 avg input tok: 18,420 avg output tok: 540 claude-haiku-4: 2 calls $0.0151 avg input tok: 1,200 avg output tok: 180 by-step: step 1 search: 1 call $0.0184 1,800 in 220 out step 2 summarize: 1 call $0.0312 3,100 in 280 out step 3 cite check: 9 calls $4.1535 avg 22,400 in avg 510 out Step 3 was supposed to be one call. It was nine. And the average input tokens were 22,400. That is the smoking gun. What was actually happening The cite-check step had a tool the model could call to fetch a source URL. When the model called the tool, I appended the tool result to the messages list and re-called messages.create . Standard pattern. What I missed: every iteration was re-attaching the full prior history including the search results from step 1 and the summary from step 2. So call 4 had everything from calls 1-3 in its input. Call 5 had everything from calls 1-4. And so on. Input tokens grew linearly per call, total tokens grew quadratically over the step. The model kept calling the tool again because the prompt was structured ambiguously. So I had an unbounded loop hidden behind a 9-iteration tool dance. O n² input tokens for n iterations. The fix was small. I stopped re-attaching the full history on each tool turn and used a sliding window. Re-ran the same run cold: run: cite-check-agent duration: 11.2s total cost usd: 0.1432 calls: 5 p50 latency ms: 2200 p95 latency ms: 3050 by-model: claude-opus-4-7: 3 calls $0.1290 claude-haiku-4: 2 calls $0.0142 by-step: step 1 search: 1 call $0.0181 step 2 summarize: 1 call $0.0308 step 3 cite check: 3 calls $0.0943 14 cents. About 30x cheaper. I would not have found the bug without the by-step grouping. What agenttrace actually does js use agenttrace::{Trace, Tag}; let mut trace = Trace::new ; let run = trace.start run "my-agent" ; run.tag "user id", "u 8821" ; run.tag "step", "search" ; // for each LLM call run.record agenttrace::CallRecord { model: "claude-opus-4-7".into , input tokens: 1800, output tokens: 220, cache read tokens: 0, cache write tokens: 0, latency ms: 2710, cost usd: 0.0184, tags: vec Tag::step "search" , } ; let summary = run.finish ; trace.append summary ; // serialize all runs let json = serde json::to string &trace.runs ?; It is a thin aggregator. It does not call the API. It does not make pricing decisions. You feed it call records typically computed from claude-cost or your own pricing function and it composes them into a run with cost, p50/p95, and per-tag breakdowns. Why p95 matters more than mean avg latency ms lies. A run with one slow call the model thought for 12 seconds, the rest returned in 2 shows a mean of about 4 seconds. The p95 shows the actual tail. For agents this is the number that tells you whether your user-facing experience is going to feel snappy or laggy. agenttrace exposes p50, p95, and p99 by default. Composing with other crates - claude-cost for the per-call cost estimate cache-aware . - cachebench to see the cache hit ratio across the run. - llm-circuit-breaker to short-circuit a run when an upstream is degraded so you do not pay $4.20 to discover that. A typical pipeline in our service looks like: cachebench records hit/miss → claude-cost computes cost given hits → agenttrace aggregates into a run summary. What this does not solve - It does not store traces durably. Trace is in-memory. You serialize to disk or to a remote sink yourself. I do that with a one-line serde json::to writer to a sqlite blob. - It does not visualize. There is no UI. You get JSON or text reports. If you want a flamegraph, pipe to your own viewer. - It does not capture the request bodies. Pair with agenttap for that. agenttrace is the cost/latency layer, not the wire layer. - The tagging system is flat. There is no nested-span model. If you need that, OpenTelemetry is the right tool and otel-genai-bridge-rs can translate between conventions. The crate is about 600 lines of pure Rust. No async lock-in. Repo: https://github.com/MukundaKatta/agenttrace-rs https://github.com/MukundaKatta/agenttrace-rs crates.io: agenttrace = { package = "agenttrace-rs", version = "0.1" } Part of a small Rust stack I publish for AI agent plumbing: cost, retry, breakers, repair, trace. Built piece by piece from real incidents.