Most companies today have one AI setup: send everything to the most powerful model available. Pay the bill. Repeat.
It works. But it's expensive, slower than it needs to be, and honestly β a bit like hiring a surgeon to change a lightbulb.
Imagine a hospital where every patient β whether they need open-heart surgery or a bandage on a paper cut β is seen by the senior consultant first.
The consultant is brilliant. But the waiting room is chaos. The costs are sky-high. And half his time is spent on things a nurse could have handled in two minutes.
That's what most AI pipelines look like today.
When your team sends something to an AI model, it might be a Python file, a customer complaint in Hindi, a SQL query, or a casual Hinglish support ticket. These are completely different problems requiring different expertise, different depth, different cost.
Yet most systems send them all to the same model, at the same price, with the same wait time.
Some inputs have hard, deterministic boundaries. A .py
file contains Python. A .sql
file contains SQL. You don't need the most powerful AI in the world to figure that out β you need a rule.
Here's what a smarter pipeline looks like:
Input arrives
β
Orchestrator SLM β a small, fast model that reads
the input and decides: what is this, who handles it?
β
βββ Python file β Python specialist model
βββ SQL query β SQL specialist model
βββ Hindi doc β Hindi specialist model
βββ Ambiguous β Frontier model directly
β
Specialist outputs + original input
β Frontier model β Final answer
There is no separate routing system to build. The orchestrator is itself a small AI model β trained to classify inputs and direct traffic. It costs almost nothing to run.
The powerful frontier model β your Claude, your GPT-4 β stays in the loop for the final answer. It just isn't doing the sorting anymore.
When specialist models pass findings to the frontier model, the instinct is to format outputs for human readability. Paragraphs. Explanations. Full sentences.
Wrong target.
The downstream consumer is another model β not a human. Specialist models should produce machine-readable structured output. Dense. Precise. No explanation.
{
"language": "python",
"issues_detected": ["unbounded loop at line 47"],
"confidence": 0.94
}
This isn't for a person to read. It's for a model to consume efficiently. The constraint is baked into training β not imposed by external truncation at runtime. Prevention over correction.
Think of it like a doctor handing a consultant a structured chart instead of a five-page narrative. Same information. Faster to read. More room to think.
The frontier model has finite working memory. Multiple specialists contributing outputs fills it fast. Here's the fallback stack, in order:
First β specialists send only essential structured signal. No reasoning traces. This is the default.
If needed β summarise the original input first. Compress before routing.
If still needed β feed specialist outputs one at a time. The frontier model builds context incrementally. Slower, but accurate.
Last resort β skip specialists entirely. Raw input directly to the frontier model. Full cost, guaranteed quality.
The pipeline always has a path to the right answer. You're just choosing how much it costs.
Honest answer: only at scale.
Specialist models are open-source β free to use, but you pay for compute. A reasonable GPU setup costs $1,000β1,100 per month. The savings come from routing a large share of queries away from expensive frontier API calls.
| Monthly AI API spend | Does this make sense? |
|---|---|
| Below $2,000 | Probably not β keep it simple |
| $3,000β$5,000 | Worth evaluating |
| Above $5,000 | Very likely yes |
One important caveat. If your team currently uses Claude.ai, Claude Code, or any managed AI interface β this architecture means moving away from that. You'd be calling APIs directly from your own system, which means building and owning the interaction layer your employees use.
| How you use AI today | What this means |
|---|---|
| Managed interface (Claude.ai, etc.) | Build a custom interface first β factor in engineering cost |
| Already using APIs with custom tooling | Plugs in naturally |
If you've used agent mode in Cursor β the AI coding tool β you've experienced this exact pattern without realising it.
Cursor doesn't send your entire codebase to one model and hope for the best. A lightweight orchestrator reads your request, decides what to do β read a file, search the codebase, run a terminal command β routes to the right tool, then a frontier model synthesises the final response.
Enterprise tools like Atlassian's Rovo are moving in the same direction for workplace workflows.
The companies that built these tools figured out that one model doing everything is wasteful. The question is whether the AI pipelines inside your organisation are designed with the same intelligence β or still sending every query to the most expensive model available.
Most AI cost and speed problems aren't model problems. They're routing problems.
The best AI pipelines look less like "one genius doing everything" and more like a well-run team: a smart receptionist, skilled specialists, and senior judgment applied only where it genuinely matters.
The question isn't which model is best.
It's: are you using the right model for the right job?
What routing decisions is your organisation making β or avoiding? Would love to hear in the comments.
Views expressed are my own and do not represent my employer.