We burned 136 million tokens running an autonomous agent studio. Here's how we cut the bill ~90%.

A developer running an autonomous agent studio burned 136 million tokens due to a self-looping agent on a timer that re-sent an ever-growing session uncached. By rebuilding the runtime to use cheap models for mechanical steps, deterministic verification, and spend caps, the team cut costs by approximately 90%.

We run a studio where AI agents work mostly unattended — they write code, ship sites, produce content, and keep going without a human in the loop. Running agents like that, around the clock, teaches you one thing fast: the bill is the product constraint. Not the model's intelligence. The bill. Here's the most expensive lesson we paid for, and the architecture we rebuilt to stop paying it. One of our agents burned ~136 million tokens in a stretch where it produced almost nothing. We assumed runaway tool calls. It wasn't. The cause was mundane and brutal: the agent was waking itself on a timer a cron / scheduled self-invoke into one ever-growing session. Two things compounded: The whole thread is re-sent every turn. An LLM is stateless. A long agent session doesn't "remember" — the entire conversation is re-uploaded as input on every single call. A session that grows to 800k tokens of context costs ~800k input tokens per turn , even if the model writes two sentences. The prompt cache expires. Providers cache your context so re-sends are cheap — but the cache has a short TTL minutes . Any timer that fires slower than the TTL means every wake-up re-reads the full context uncached , at roughly 10× the cached price. So: a self-looping agent, on a timer longer than the cache window, re-sending an ever-larger thread, uncached, forever. That's how you turn "a few cents of output" into 136M tokens. The reflex fix is "use less / set a token limit." That caps the damage; it doesn't change the economics. You're still running every step on a frontier model, still re-sending context, still paying frontier prices for work a much cheaper model could do. The real fix was to stop treating the expensive model as the default. We rebuilt the runtime around four cost-native principles. None of them are exotic — but no agent framework ships them as the default, because most of the ecosystem makes money when your usage goes up , not down. A frontier model running a recurring loop in one session is the single most expensive pattern in agent ops. We banned it. Recurring, autonomous work runs off the frontier model entirely — a cheap planner decomposes the goal, a cheap/local worker executes, and a deterministic check verifies. The frontier model is summoned only when a genuine human-level judgment call is needed, and then in a fresh, lean session. This is the lever almost nobody defaults to, and it's the biggest one. Most steps in an agent loop are mechanical: read a file, run a command, reformat output, check a condition. You do not need a $15/M-token model for that. You need a $0.14/M-token model — or a local one running at ~$0 marginal cost. We route by step: Reported savings for this pattern across the industry land at 60–86% . Our own bill dropped about an order of magnitude. The quality cost is near zero if you add the next piece. The fear with cheap/local models is quality. The answer isn't "trust the cheap model" — it's " let the cheap model do the work, then verify it with something that can't lie. " A test suite. A linter. A schema check. An exit code. If the cheap model's output passes a deterministic gate, it's correct by construction and you never paid frontier prices to find out. If it fails, you retry or escalate. The verify-gate is what makes aggressive downshifting safe. Every agent runs under a spend cap. Cross it and the agent defers, it doesn't barrel on. And we attribute spend per agent — so when something costs too much, we know exactly which one and why, instead of staring at one big number at the end of the month. The 136M fire was invisible precisely because nothing attributed cost to the loop while it ran. The context-resend problem has a second fix beyond caching: keep sessions short and lean, and put continuity in durable files, not in one infinite thread. Our agents write their state, decisions, and memory to disk. A fresh session reads a small digest of what matters instead of re-uploading a 500k-token history. Short threads are cheap threads. If routing-to-cheap saves 60–90%, why doesn't every agent framework do it out of the box? Incentives. The big agent frameworks and observability tools monetize usage, seats, and traces — they grow when your token count grows. The model providers obviously don't profit from you spending less with them. So the most valuable cost lever in agent engineering is the one nobody in the value chain is motivated to ship as a default. It gets left to you. That's the whole reason we open-sourced our runtime. You cannot run a real agent network on a frontier model alone. It will cost you your margin. The architecture that works: We learned it by setting 136 million tokens on fire. You don't have to. We're building this in the open — a cost-native, brain-agnostic agent runtime run any agent on a local model, a cheap API, or a frontier model; same code, same isolation . If you want the runtime or the deeper postmortems, follow along.