# How to Slash Your LLM Bill With a Multi-Agent Setup

> Source: <https://pub.towardsai.net/how-to-slash-your-llm-bill-with-a-multi-agent-setup-3c6c36f3ad47?source=rss----98111c9905da---4>
> Published: 2026-06-26 15:01:02+00:00

*Running every task through your most powerful AI model is like paying a senior architect to photocopy. The smarter setup is a team, one expensive model that thinks and plans, and a crew of cheap, fast models that do the actual work. The price gap between those tiers is not small, it is often ten to a hundred times per token, so the savings are real and large. Here is how the structure works, why the math is so lopsided, and how to build it without a single special tool or library.*

There’s a habit almost everyone falls into when they start building with AI. You find the best model available, the smartest, most capable one, and you route everything through it. Planning, hard reasoning, simple formatting, renaming variables, reading a file, writing boilerplate, all of it goes to the top-tier model, because if that model is the best, surely using it for everything gives the best result.

It does give a good result. It also gives you a bill several times larger than it needed to be, because most of what you asked it to do didn’t require anything close to its full ability. You used a senior architect to operate the photocopier, and then you paid the architect’s hourly rate for the copying.

There’s a better structure, and it’s borrowed from how real teams work. Instead of one expensive expert doing every job, you build a team. One powerful model acts as the brain, it takes the goal, thinks it through, and breaks it into pieces. Then it hands the pieces off to cheaper, faster models that do the actual labor. The brain plans, the workers execute, and you pay top-tier prices only for the small slice of work that genuinely needs top-tier intelligence. The result is the same finished job at a fraction of the cost, and you can build the whole thing with nothing but ordinary API calls. No framework, no library, no special tooling. Here’s how it works and why the savings are so large.

To see why this works, you have to look at what different models actually cost, because the gap is far bigger than most people realize.

AI models are priced per token, roughly per chunk of text in and out, and the price difference between the top tier and the bottom tier is enormous. A frontier model, the kind you would pick as your brain, runs in the neighborhood of 5 dollars per million tokens of input and 25 to 30 dollars per million tokens of output. A capable budget model, the kind you would use as a worker, can cost as little as 10 to 40 cents per million tokens, and some are cheaper still. That isn’t a small discount. Depending on which models you compare, the cheap one can be ten, twenty, even a hundred times less expensive per token than the expensive one. Across the whole market, the price of output tokens spans from under 30 cents to over 30 dollars, a range of more than a hundred to one.

Now think about what that means for a real job. Suppose you use one frontier model to handle a thousand small tasks, most of which are simple, formatting text, extracting a field, making a routine edit. You’re paying the highest per-token rate in the market for work a model costing a fraction as much could do just as well. Route those simple tasks to a cheap worker model and keep only the genuinely hard reasoning on the expensive brain, and the bill does not drop by a little. It can drop by most of itself. One widely cited example of this exact setup took a workload from about 10,500 dollars a month to roughly 1,500 dollars a month, an 86 percent cut, with almost no drop in quality because the expensive model was still doing all the hard parts. The cheap models just stopped being paid premium rates for cheap work.

That’s the whole financial case in one line. The expensive model is worth its price for the hard 10 percent of the work and a waste of money on the easy 90 percent, and a team structure lets you pay the right price for each.

The useful way to design this is to think in roles, exactly like staffing a real project, and you don’t have to stay within one company’s models. The best setups mix providers, using whichever model is cheapest for the capability each role needs.

The brain is your most capable model, and it does the least typing and the most thinking. Its job is to take the goal, reason about how to approach it, break it into clear sub-tasks, and decide which worker handles each. It’s also where you escalate anything a worker gets stuck on. Because the brain mostly plans rather than producing huge volumes of output, you use it sparingly, and its high per-token cost barely matters since it’s responsible for only a sliver of the total tokens. A current frontier model like Opus 4.8 or a top GPT or Gemini model is the right pick here.

The managers are your mid-tier models, the ones with solid reasoning that costs a fraction of frontier prices. They handle the work that needs real competence but not the absolute ceiling, drafting, moderate analysis, multi-step tasks that are not the hardest ones. A model like Claude Sonnet or Gemini Pro sits in this tier, capable enough to be trusted with substance, cheap enough to use freely.

The hustlers are your cheap, fast workers, and they do the bulk of the labor. Formatting, extraction, simple edits, classification, routine transformations, the high-volume mechanical work that makes up most of any real job. This is where a small, inexpensive model earns its keep, a Haiku-class model, a Gemini Flash tier, or a low-cost model like DeepSeek, any of which costs pennies per million tokens. You run these constantly and barely feel the cost, which is the entire point.

Match the model to the difficulty of the task, and mix across providers to get the cheapest capable option for each tier. The brain thinks, the managers handle the real-but-not-hardest work, and the hustlers grind through the volume. Each one is priced for its job.

There are two honest ways to put this team structure to work, and they suit different people. The first uses an agentic tool that already exists and needs almost no code. The second is the build-it-yourself route for full control. Both get you the same payoff, and both are laid out as steps below.

For most people this is the easier path. Several of the popular agentic coding tools already let you assign a different model to each agent, so you just configure the roles and the tool handles the orchestration. OpenCode and Claude’s command-line tool are two clear examples, and the steps are nearly identical in spirit. Here it is using OpenCode, which has the advantage of letting you mix models from different providers in one team.

Step 1, install the tool and connect your models. Install OpenCode, then add the providers whose models you want to use, so you have a frontier model, a mid-tier model, and a cheap model available. Set a sensible default model in the config file, for example a mid-tier model, so anything unspecified runs at a reasonable price.

Step 2, create a planner agent on a strong model. Agents are just small markdown files you drop in an agents folder, with a few lines of settings at the top and a system prompt below. Make one for your planner, set its model to a frontier model, and tell it in the prompt to break a task into smaller sub-tasks and delegate rather than do everything itself. In practice the top of that file is a handful of lines, a description, the mode set to primary so it can orchestrate, and the model line pointing at your frontier model.

```
---description: Plans work and delegates to worker agentsmode: primarymodel: anthropic/claude-opus-4-8---You are the planner. Break the request into small sub-tasks,decide which worker should handle each, and delegate. Do thehard reasoning yourself, but hand routine work to the workers.
```

Step 3, create worker agents on cheaper models. Make one or more worker agents the same way, each its own markdown file, but set their model to a cheaper tier and their mode to subagent so the planner can invoke them. A bulk-work agent might point at a small, cheap model, while a mid-tier worker for moderately hard tasks points at a mid model. This is where the diversification happens, and you can point different workers at different providers to get the cheapest capable option for each.

```
---description: Handles routine, high-volume tasksmode: subagentmodel: anthropic/claude-haiku-4-5---You are a worker. Do the specific task you are given quicklyand return the result. Keep it focused.
```

Step 4, let the planner delegate. Start a session with your planner as the active agent. When you give it a goal, it breaks the work down and hands sub-tasks to the worker agents, which run on their cheaper models, while the planner stays on the expensive one only for the planning and the hard parts. The tool manages the message-passing between them. Many of these tools also let the main agent pick a worker automatically based on each agent’s description, so once your agents are defined, the routing largely takes care of itself.

Step 5, tune which worker gets what. Watch where the cheap workers do well and where they struggle, and adjust, sharpen each agent’s description so the planner routes the right tasks to it, or move a task type up to a more capable worker if the cheap one keeps failing it. That tuning is the whole job once the agents exist.

That is the entire setup in a tool, a planner on a strong model, workers on cheap ones, each defined in a few lines, and the tool handling the orchestration between them. If one of these tools already fits your workflow, this is the fastest route by far.

This is the right choice when you want full control over the routing, want to use models or providers a tool does not support, or are building this into your own software rather than a coding session. The reason to do it yourself isn’t that it’s easier, it’s that you decide exactly how tasks are split, routed, and escalated. Here’s that build, step by step.

Step 1, get API access across your three tiers. Sign up for the providers you want and get an API key for each, a frontier model for the brain, a mid-tier model for the managers, and a cheap model for the hustlers. They don’t have to be from the same company, and mixing providers to get the cheapest capable option per tier is the point, so pick whatever gives you the best price at each level.

Step 2, write one function per tier. Make three small functions, one that sends a prompt to your brain model, one to your mid-tier model, and one to your cheap worker. Each is just a basic API call that takes a prompt and returns the response. This is the whole toolkit, three functions that each talk to one model.

Step 3, have the brain make a plan. Send your overall goal to the brain model with an instruction along the lines of, break this task into a list of smaller sub-tasks, and for each one label how hard it is, simple, medium, or hard. The brain returns a structured plan, a list of sub-tasks each tagged with a difficulty. This single call is the only place you use your expensive model heavily, and it’s cheap because it’s one planning step, not the whole job.

Step 4, write the routing rule. This is the heart of it, and it’s a few lines. For each sub-task in the plan, look at its difficulty label and send it to the matching function, simple goes to the cheap worker, medium goes to the mid-tier model, hard goes back to the brain. That’s the entire routing logic, a rule that maps a difficulty label to one of your three functions.

Step 5, loop through the tasks and collect the results. Walk the list of sub-tasks, run each one through the routing rule so it goes to the right model, and gather the outputs as they come back. At the end you stitch the pieces together into the finished result. Most of these calls are hitting your cheap worker, which is exactly why the bill stays low.

Step 6, add an escalation check. For each result, do a quick check, did the worker actually complete the task, or did it fail or return something obviously wrong. If a cheap model couldn’t handle a piece, send that one piece up to a more capable model and use that answer instead. This is what keeps quality high, the cheap models do the easy bulk, and anything they stumble on gets escalated.

That’s the whole architecture, three model functions, a planning call, a routing rule, a loop, and an escalation check. Nothing else.

One simplification worth knowing. You don’t even strictly need the brain to do the planning and labeling. For many jobs a crude rule works fine, route by task length, by whether code is involved, by a keyword or two, sending anything that looks simple to the cheap model and anything that looks hard to the expensive one. A middle option is to use a cheap model itself as the classifier, a fast, inexpensive call that just labels each task simple, medium, or hard before the real work begins. Either way, the routing stays a few lines, not a dependency.

The only real tuning is deciding where your cutoffs are, which tasks count as simple enough for a hustler and which need to go up the chain, and you dial that in by watching where the cheap models succeed and where they fall down.

This is a genuinely better setup for most real work, but it isn’t free, and pretending otherwise would be selling you something.

Coordination costs tokens. The brain spends tokens planning and splitting the work, and that overhead is real, though it’s usually small next to what you save by keeping the bulk of the labor cheap. Cheap models fail more often on anything tricky, which means sometimes a hustler botches a task, you detect it, and you pay again to have a better model redo it, so your savings are a little smaller than the raw price gap suggests. There’s added complexity too, more moving parts, more places for something to go wrong, and more effort to debug a pipeline than a single call. And a subtle one worth knowing, in long multi-step jobs the biggest hidden cost is often the accumulated context being re-sent on every call, which a team structure helps with but does not entirely solve.

None of that changes the conclusion. Even after you account for the coordination overhead, the occasional retry, and the extra complexity, the team structure comes out well ahead on cost for any job with a meaningful amount of routine work in it, which is almost all of them. The savings from not paying frontier prices for grunt work are simply much larger than the costs of organizing the team. The one case where it isn’t worth it is a small, one-off task, where a single capable model is simpler and the savings would be trivial. For anything high-volume or repeated, the math favors the team, and it favors it heavily.

The real change here isn’t a trick, it’s a way of seeing the problem. Most people optimize by asking which model is best and then using it for everything. The team approach asks a sharper question for every piece of work, which is the cheapest model that can do this particular task well. That question, asked task by task, is what turns a giant bill into a small one.

The expensive model isn’t the enemy. It’s genuinely the best at the hard parts, and you should absolutely use it for them. The mistake is letting it do the easy parts too, at the same premium price. Hand the thinking to your most capable model, hand the labor to the cheap ones, mix providers to get the best price at every tier, and let each model do the job it is actually priced for. You’ll get the same work done for a fraction of the cost, and once you have seen the structure, using one model for everything will look like exactly what it is, paying a genius to do grunt work.

*If you build a tiered setup like this, drop a comment with the models you used for each role and the mix of providers you landed on. The combinations people actually run, and where they drew the line between cheap and expensive, are the most useful thing for the next person trying to cut their bill.*

[How to Slash Your LLM Bill With a Multi-Agent Setup](https://pub.towardsai.net/how-to-slash-your-llm-bill-with-a-multi-agent-setup-3c6c36f3ad47) was originally published in [Towards AI](https://pub.towardsai.net) on Medium, where people are continuing the conversation by highlighting and responding to this story.
