# I Cut My AI Agent's Token Bill by 62% in One Weekend. Here's the Receipts.

> Source: <https://dev.to/mrclaw207/i-cut-my-ai-agents-token-bill-by-62-in-one-weekend-heres-the-receipts-1fp1>
> Published: 2026-06-19 13:06:16+00:00

My agent spent $5.40 to do what a 200-line script does for free. Then I spent a weekend fixing it, and brought the same workflow down to $2.05 per run — a 62% drop with no measurable quality regression. This is the breakdown, with the actual prompt diffs and the benchmarks that mattered.

The agent I run most is a research-and-summarize loop. It searches the web, scrapes ~20 pages, drafts a structured summary, and writes a file. Sounds harmless. The bill said otherwise.

Three things were quietly hemorrhaging tokens:

A 2026 Stevens Institute analysis pegs unconstrained agents at $5–$8 per task. Mine was $5.40. Textbook.

The old pattern:

```
# BAD: pay for the whole page, then ask the model to find the bit you wanted
page = fetch(url)  # ~50,000 chars
response = llm(f"Summarize this page, focusing on {topic}:\n\n{page}")
```

The new pattern:

```
# GOOD: filter first, then send only what's relevant
page = fetch(url)
chunks = chunk(page, max_chars=4000)
relevant = [c for c in chunks if keyword_score(c, topic) > 0.3]
relevant = relevant[:5]  # hard cap
response = llm(f"Summarize these excerpts for {topic}:\n\n" + "\n---\n".join(relevant))
```

Token usage dropped from ~12,500 input tokens per page to ~3,200. Quality went up — fewer hallucinations, because the model wasn't drowning in noise.

Old system prompt: 1,180 tokens.

New: 440 tokens.

The win wasn't in what I added — it was in removing redundancy. Three things got deleted:

`web_search`

does. One short line is enough.I ran the same 50-task eval suite before and after. Output quality was statistically indistinguishable. The 740 tokens saved per call added up to about $180/month on my volume.

This was the biggest single win. I split my agent's steps into three tiers:

| Step | Old model | New model | Cost per call |
|---|---|---|---|
| Extract key facts from chunks | GPT-5.4 | Claude 4 Sonnet | $0.003 → $0.0008 |
| Draft structured summary | GPT-5.4 | GPT-5.4 | $0.018 (unchanged) |
| Quality check + rewrite | GPT-5.4 | Claude 4 Sonnet | $0.003 → $0.0008 |

The reasoning-tier model only touches the synthesis step. Everything else runs on a cheaper, faster model that's still good enough for extractive work.

Routing logic, in 15 lines:

``` python
def route(step):
    if step.requires_reasoning:
        return "gpt-5.4"      # synthesis, planning, judgment calls
    if step.requires_long_context:
        return "claude-4-sonnet"  # chunk summarization, fact extraction
    return "gpt-5-mini"         # formatting, light edits
```

I didn't trust my gut on quality. I ran a 50-task eval suite with three different rubrics:

Numbers, before vs. after:

| Metric | Before | After | Change |
|---|---|---|---|
| Cost per task | $5.40 | $2.05 | -62% |
| Median latency | 41s | 28s | -32% |
| Fact accuracy | 0.81 | 0.83 | +0.02 (noise) |
| Citation coverage | 67% | 89% | +22pp |
| User satisfaction | 0.74 | 0.78 | +0.04 |

Citation coverage went *up* because chunk-then-extract gives the model cleaner evidence to cite. Latency dropped because smaller models respond faster. Fact accuracy was a wash — which is what you want, because the whole point was to cut cost without hurting quality.

Three things, in order of ROI:

`{task_id, step, model, input_tokens, output_tokens, cost}`

per run is the highest-leverage observability you'll add this year.The reflex in 2026 is to reach for a bigger model when quality dips. Most of the time, the answer is a smaller model with a tighter context.

The agent didn't get smarter. The pipeline got more honest about what each step actually needs.

If you're running agents in production and you haven't looked at your per-step token breakdown in the last 30 days, that's where I'd start. The $847/month I'm saving came from one weekend and three files changed.
