A better way to manage LLM spending

Model routing is emerging as a new abstraction layer to manage LLM spending, allowing prompts to be directed to the most cost-effective model. Companies like Coinbase report halving AI spend while increasing token usage. This approach mirrors historical software abstractions, from assembly to compilers to LLMs.

As an old Delphi https://en.wikipedia.org/wiki/Delphi software guy, I remember well the “language wars” we had with the Visual Basic https://en.wikipedia.org/wiki/Visual Basic classic guys. An early codename for Delphi was “VBK” — VB Killer — and the VB community took exception. They’d come to our Delphi forums and pick fights. Naturally, we brash Delphi guys would fight back, engaging in big flame wars and getting all worked up over what wasn’t much more than a personal preference. Good times. These days, we’ve moved the discussion up a layer — what is the better model for coding? Things aren’t quite as intense as the VB/Delphi dustups, but people have their opinions. Companies are taking a look at different models before choosing one for their teams. Most teams have arrived at a family of models that they use. At some point, chatting with Claude or Codex started to seem a bit raw. It wasn’t long before scaffolding tools like GStack https://github.com/garrytan/gstack and Superpowers https://github.com/obra/Superpowers were adding underpinnings for interacting with LLMs — baseline instructions for handling prompts before they get to the model itself. They help establish useful context and act as a layer above “raw prompting”. Context engineering https://www.infoworld.com/article/4127462/what-is-context-engineering-and-why-its-the-new-ai-architecture.html is the first and most common layer to add on top of the chat interface. And then once the choice of models and harnesses was made, everyone went crazy with tokenmaxxing https://www.infoworld.com/article/4170173/tokenmaxxing-is-super-dumb.html . If you have a model, of course you want to get the most out of it. But when the bill came in, managers were not pleased. As costs skyrocketed, leadership worried that the money wasn’t being well spent. Just as assembly language and hand-tuning registers gave way to compilers and structured languages, which led to frameworks and libraries, and most recently to LLMs and prompting, it is starting to occur to developers and managers that there is a better way to manage LLM spending. But naturally, the minute you figure out how things work, another layer appears, making all your hard-earned knowledge outdated. Apparently being able to code in English https://www.infoworld.com/article/4018953/the-ultimate-software-engineering-abstraction.html isn’t enough to stop the next abstraction from appearing. So as is always the case, another layer of abstraction has come along https://medium.com/nickonsoftware/what-is-the-next-layer-bdc0280723a8 . Sic semper fuit . Thus model routing is the latest way to maximize the value for each dollar spent on tokens. The idea is that not all prompts are created equal. Not everything that you ask Claude is going to require the deep thinking of a frontier model. A model router can take a look at the prompt and decide what model is best suited to answer that prompt and direct the query to that model. Maybe simpler requests are better suited for an older model. Maybe code reviews are better done with a model specifically designed for that purpose. Model routing leads to more efficient token spending. When you run Claude Code today, you have to choose a model for the whole session, and if you want to use the top-tier model, you have to pay for it no matter what you end up doing. A model router lets you vary the model — and thus the cost. Organizations like Coinbase https://x.com/brian armstrong/status/2070670644577280109?s=20 are seeing their AI spend cut in half while their token usage increases. LLMs are constantly evolving, becoming both more powerful and more specialized. Being able to route a prompt to the model that is both well-suited for the task and cost-effective is the way to maximize token effectiveness. Teams are doing this manually now, but AI itself will become the best way to make such decisions. For example, Claude Code Router https://github.com/musistudio/claude-code-router can route prompts to any number of popular models, depending on the type of work each prompt requires. And it’s open source. The next layer that is coming is the preprocessing of prompts. We can work to write good prompts, but AI itself can improve upon what we ask. One of the best techniques in prompting is to tell the LLM to “ask the questions that I’m not asking but should be asking”. I can easily imagine a world in which you write a prompt, AI helps you clarify it, improves it, and then routes it to the best, most cost-effective model for an answer. You won’t be choosing a given LLM provider anymore. Instead, you can focus on specifying exactly what you want. So stop hand-crafting your prompts for a specific model. Let the coming model routers and prompt preprocessors do the hard work for you.