# Multi-agent fleets burn ~15x the tokens — here's the budget layer the playbooks skip

> Source: <https://dev.to/ahmadammar/multi-agent-fleets-burn-15x-the-tokens-heres-the-budget-layer-the-playbooks-skip-4i9e>
> Published: 2026-06-29 19:41:26+00:00

Give ten agents a shared, metered tool — a paid search or research API where every call is real

money — and you've handed ten of them the same company credit card. Each one reasons "I'll just run

a quick search." You find out the total on the invoice.

Anthropic's own multi-agent write-up clocks a fleet at roughly 15x the tokens of a single chat.

That's the token pool. The paid external tools are the line item the orchestration playbooks skip —

and it's the one that shows up in actual dollars. I run four production projects on Claude Code with

no external agent framework, and this is the pattern that keeps that line item from surprising me.

You can't solve this by asking agents to be frugal. "Be mindful of the budget" in a prompt is a

wish, and an LLM will sail past "max 8 searches" the moment the task still feels unfinished. If your

budget lives in the prompt, you don't have a budget — you have a hope.

Cost governance is two layers, and conflating them is why prompt-level "budgets" fail:

**1. Enforcement is deterministic and lives below the model.** A hard counter in the harness owns the

paid credential and denies the call after N. The model can't argue with it, jailbreak it, or sneak

"just one more search" past it. This is what actually bounds the dollars — and it's plain middleware,

not an agent. A dumb metered proxy in front of the API does this fine.

**2. Judgment is the half a proxy can't do.** A gateway can cap spend, but it can't decide whether

*this task deserves to spend at all*. That decision is model-shaped, and it's where the agent earns

its place:

So the division is clean: **the gateway enforces, the agent judges.**

The novel part isn't "centralize the credential" — that's least-privilege, decades old, and

centralizing alone doesn't cap anything (N callers can still demand N searches through one door). The

leverage is pairing a deterministic cap with model-shaped judgment about *whether to spend at all* —

with the gate defaulting to "no."

If the budget is a single shared number across parallel agents, you're holding a distributed-counter

problem: two agents can both read "budget remaining," both decide they're clear, and both spend —

blowing the cap. You have two honest options. The simplest is to serialize every paid call through

the one budgeted agent: correct and easy to reason about, but that agent becomes a throughput

bottleneck for the whole fleet. The other is an atomic reservation — each call reserves its slice of

the quota *before* spending and releases the remainder after, which keeps agents parallel at the cost

of a little more plumbing. For metered money, paying either price beats discovering the race

condition on the invoice — just don't pretend it isn't there.

Most "agent budget" advice optimizes the cap. The leverage is one level up: a gate that makes most

work never spend at all, and a hard counter the model can't talk its way past. Stop asking your fleet

to be frugal. Give exactly one agent a budget the harness enforces — and the judgment to know when

not to use it.
