# Choose Wisely: Models Should Follow Your Use Case.

> Source: <https://pub.towardsai.net/choose-wisely-models-should-follow-your-use-case-9e1c420fbbf6?source=rss----98111c9905da---4>
> Published: 2026-06-25 07:42:09+00:00

A guy in my builder’s discord group blew his entire Codex subscription in eleven days. Two weeks into the month, nothing left. You know what he was building? A billing feature in his SaaS. Not a compiler. Not an operating system kernel. Not a real-time physics simulation. A billing page with subscriptions, invoices, and a [Dodo Payments](https://app.dodopayments.com/partners/FZYjxGGVjh/signup) webhook that doesn’t send duplicate emails.

He said it with the exhausted pride of someone who just pushed to prod at 2 AM (we devs are batmans, right?). I nodded. I didn’t say anything. But inside I was doing the mental math.

I run my full AI stack coding agent, agent workflows, browser automation, speech to text, for around $10 — $15 a month. And I ship. Regularly (my [github ](https://github.com/dhanushk-offl/)is proof for that). With billing features and everything.

That conversation is what this post is about.

Let me describe a pattern you’ve probably noticed.

A big AI company/lab drops a new model/version. The announcement lands. Within hours, everyone on X is posting about it. “*Our model built a C compiler from scratch.” “Our model achieved gold on the International Math Olympiad.” “Our model solved problems that researchers said required human-level reasoning.”*

The posts get thousands of likes. Engineers screenshot the benchmark charts. Someone puts together a thread comparing it to the previous generation. Replies flood in from founders saying they’re switching immediately.

Then someone from Chennai quietly tries it on their actual codebase and reports back that it’s roughly the same as before for their use case. This tweet gets eleven likes.

I’m not mocking the benchmark results. Building a C compiler is impressive. Scoring on the IMO is legitimately hard. These results tell you something real about what the model is capable of in controlled settings.

But here is the question nobody asks loudly enough: when was the last time your actual work required an AI to build a C compiler?

Look at what you built last week. Probably a REST endpoint. A React component that talks to it. Some data validation logic. An email template. A webhook handler. A cron job that moves rows between two database tables. Maybe a RAG pipeline if you’re in the AI space. Something with auth. Something with payments.

You are not building compiler infrastructure. You are building software for users. Web apps. Mobile apps. Developer tools. Internal automation. The kind of work that, individually, each piece looks boring on a benchmark slide but collectively represents most of the software being written on earth today.

The benchmark score tells you the ceiling of what a model can achieve on curated academic tasks. It does not tell you whether the model is the right tool for your Monday morning standup’s ticket queue.

I learned this slowly. And expensively.

Before I get into the specific models, I need to clear up something that trips up engineers constantly. When someone says a model is “open source,” they usually mean one of two very different things, and conflating them leads to bad decisions.

The first is open weights. The actual model parameters, the billions of floating point numbers that encode what the model knows are publicly available. You can download them. You can run them on your own hardware. You can fine-tune them on your own data. You can deploy them inside your own VPC and never send a single token to anyone else’s server. You can modify the architecture and release derivatives. Models like GLM-5.2, DeepSeek V4, Kimi K2.6, and Nemotron from NVIDIA are all open-weight models. The weights live on Hugging Face. Most of them ship under MIT licenses, which means you can use them commercially without paying anyone a licensing fee.

The second is what most of the subscription-based coding tools are: API access. You get to call their endpoint. The model runs on their servers. Their data retention policy applies to your prompts. Their pricing can change next quarter. If their infrastructure has issues on the day you have a demo, that is your problem too. You never see the weights. You cannot run it locally. The model is theirs; you are renting access.

The practical difference matters more than most engineers realize until they’ve felt it.

With open weights, your inference cost is literally your compute. You can run through OpenRouter or Together AI and pay per token with no monthly subscription, switching to a better model the day it ships. You can cache aggressively. You can self-host if the data sensitivity requires it. You are not locked into anyone’s pricing model.

There is also a comfortable middle path, which is what I run: open-weight models accessed through inference providers. Pay per token, no subscription, full flexibility to switch, and the per-token cost is typically a fraction of what the closed model APIs charge.

I’ve read too many “why I use open source models” posts that are basically just “open source good, closed source bad” with a Hugging Face link at the bottom. Useless. Let me be specific.

When GLM-5.2 dropped from Z.ai, the Beijing-based lab that used to be called Zhipu AI the X(twitter) reaction was something. [Aravind Srinivas](https://x.com/AravSrinivas/status/2069146151325257913?s=20) posted about it. [Guillermo Rauch](https://x.com/rauchg/status/2068517095818809770?s=20) appreciated it. The Artificial Analysis Intelligence Index ranked it at 51 points, which put it above DeepSeek V4 Pro, Kimi K2.6, and even some Google models. On their GDPval-AA v2 metric, which is their best approximation of real agentic task performance, GLM-5.2 roughly matched GPT-5.5.

But you know how it goes. X(Twitter) energy is its own genre. I do not make infra decisions based on who gets quote-tweeted by whom.

So I used it. On a $10/month OpenCode Go plan, using it daily. The billing feature I built with it subscriptions, metered usage, invoice generation, [Dodo](https://app.dodopayments.com/partners/FZYjxGGVjh/signup) webhook handling with idempotency keys so the emails don’t duplicate, the model handled all of it without me holding its hand through every function. I was not babysitting it. I was shipping.

Technically, GLM-5.2 is a Mixture-of-Experts model with 753 billion total parameters. The “Mixture-of-Experts” part is important enough that I’ll explain it properly in a section below. The context window is one million tokens, which sounds like a spec-sheet number until you actually try to feed it your entire backend codebase and watch it reason across files you thought it would lose track of.

The interesting architectural detail is something Z.ai calls IndexShare. Here’s the problem it solves: at one million tokens, standard transformer attention is computationally brutal. The cost grows quadratically with context length, so a 1M token context isn’t just ten times more expensive than a 100K context, it’s more like a hundred times more expensive. IndexShare gets around this by reusing the same lightweight indexer across every four consecutive sparse attention layers instead of computing a new one for each. At 1M tokens, this cuts the per-token FLOPs by 2.9 times. That is not a minor tweak. That is what separates “supports 1M context” on a benchmark slide from “can actually use 1M context in production without your inference costs going vertical.”

The thing that sold me was not the benchmark. It was the day I ran it against a payment service codebase I’d inherited from a previous project a thing with four different [Dodo](https://app.dodopayments.com/partners/FZYjxGGVjh/signup) event handlers, some legacy subscription logic, and a webhook processor that had comments like “TODO: figure out why this sometimes fires twice” dating back to 2021. GLM-5.2 read the whole thing, understood the context, and helped me fix the duplicate-fire issue without me having to summarize what each file did. That was the moment.

For Hermes, my internal automation system that handles repetitive background tasks and orchestration workflows. I’ve been using Kimi models from Moonshot AI.

The Kimi series has moved fast. K2 in July 2025. K2.5 in January 2026. K2.6 in April 2026. Each release meaningfully closed the gap with closed frontier models. K2.6 is where I landed.

It’s a one-trillion-parameter MoE model, but only 32 billion parameters are active per token, which means inference cost is roughly that of a 32B model, not a trillion-parameter model. On SWE-Bench Pro, it ties GPT-5.5 at 58.6%. API pricing is around $0.95 per million input tokens and $4.00 per million output. GPT-5.5 is considerably more expensive. The math is not subtle.

What matters for agent use cases is coherence across long chains of tool calls. A lot of cheaper models sound fine in isolation but start going sideways somewhere around tool call fifteen in a chain of fifty. They lose the thread. They start contradicting earlier steps. They confidently do the wrong thing.

Kimi K2.6 has what Moonshot calls Agent Swarm, a native architecture for multi-agent coordination. The model can decompose a complex task into parallel sub-agent workstreams and coordinate the outputs. In practice, for my Hermes workflows, this means I can set up long-running automation tasks, leave them running, and come back to coherent results rather than having to babysit the run like a new intern’s first week.

MiMo models from Moonshot handle lighter tasks and shorter context requirements in the same system. Think of it as routing: not every task needs the full K2.6 capacity, so lighter tasks go to lighter models and costs stay proportional to complexity.

This is where my setup gets a bit unusual and I want to explain the reasoning.

[Browser OS](https://github.com/browseros-ai/BrowserOS) automation, interacting with web UIs, extracting structured data, triggering workflows across tools, benefits from a model that can reason across long sessions while staying fast enough that the loop doesn’t feel dead. You want the model to remember what it did two hundred steps ago, but you also need tokens per second to be high enough that the automation completes before you’ve had time to make and finish your second cup of coffee.

DeepSeek V4 handles the reasoning and planning layer. The V3 and V4 lineage matters here: DeepSeek demonstrated that you can train frontier-quality models for around six million dollars, compared to the hundreds of millions that Western labs were spending on comparable generations. This was not just a cost story, it was a signal that the training efficiency techniques they developed were genuinely novel. V4 uses Compressed Sparse Attention, which compresses token sequences into summary representations and selectively attends via top-k selection. This is what makes the one-million-token context viable for long automation sessions where the agent genuinely needs to remember state from hours ago.

Nemotron from NVIDIA is architecturally the most interesting thing I use. The Nemotron 3 family — Nano, Super, Ultra — is built on a hybrid [Mamba-Transformer](https://developer.nvidia.com/blog/introducing-nemotron-3-super-an-open-hybrid-mamba-transformer-moe-for-agentic-reasoning/) Mixture-of-Experts architecture. The key decision: replace most self-attention layers with [Mamba-2 layers](https://developer.nvidia.com/blog/introducing-nemotron-3-super-an-open-hybrid-mamba-transformer-moe-for-agentic-reasoning/). Standard transformer attention has a KV cache that grows linearly with context. That means memory pressure climbs continuously as the context gets longer. Mamba-2 layers maintain a constant-size state instead of a growing cache, so the memory footprint stays flat regardless of how long the context is. For automation workloads with very long running sessions, this is not a theoretical advantage. You actually feel it.

NVIDIA released Nemotron with open weights, the full training data, and the training recipes. Not just the weights. The recipes. Super has 120B total parameters with 12B active. Ultra goes to 550B total with 55B active. And NVIDIA built and evaluated it explicitly with developer harnesses in mind OpenCode, OpenHands, coding review loops, not just as a chat interface. That shows up in how reliably it handles multi-step tool use without losing track of the task structure.

I’ll keep this one short because the experience says it better than any spec.

I was using WhisperFlow for voice input. Perfectly fine. Then a Chennai guy and I say this with full affection, because if you know, you know mentioned Parakeet at a tea kadai (cafe) discussion that somehow turned into a thirty-minute model comparison session. I tried it the next morning with handy.computer (one my fav tool).

Parakeet TDT 0.6B v2 is 600 million parameters, trained on 64,000 hours of diverse audio, ranked first on the Hugging Face Open ASR leaderboard, 6.05% word error rate, and inference running 50 times faster than comparable models. The architecture is a FastConformer with a TDT decoder, handles up to 24 minutes of audio in a single pass. Actually worth for their benchmark hypes.

But what the numbers don’t tell you: it handles Indian-accented English well. My Tamil-inflected English, the way I say “idempotent” which is apparently not how Americans say it, function names like handleDodoWebhookRetry spoken aloud mid-thought, it transcribes all of this without losing the plot. I was skeptical for about a morning. I was not skeptical after that. By the way, you need some customization on top of this :)

Bye bye WhisperFlow. The Chennai recommendation stood.

You noticed that almost everything I mentioned uses Mixture-of-Experts. That is not a coincidence. It’s worth understanding why this architecture has become the dominant pattern and cool in 2025 and 2026.

The classic neural network problem: to get smarter, you need more parameters. More parameters means higher memory usage, slower inference, and higher cost. Dense models, where every parameter activates for every token scale poorly once you get into the hundreds of billions.

MoE breaks this by doing something more like what specialized teams do. Instead of one giant generalist brain, you train a collection of expert sub-networks, each specializing in different kinds of knowledge. When a token arrives, a routing mechanism itself learned during training which decides which experts are most relevant and activates only those. For Kimi K2.6, this means one trillion total parameters but only 32 billion active per token.

Think of it this way. Imagine you have a startup with 200 employees. When a customer files a billing dispute, you don’t put all 200 people in the room. You route it to the two finance people and one customer success manager who are actually relevant. The other 197 people keep doing their own work. You get the collective knowledge of the full organization with the throughput of a small focused team.

That is roughly what MoE does, except the routing is learned from data and operates at the level of each token.

What this means for cost: you pay for inference on 32B active parameters, not 1T total. The model learned from the full trillion-parameter training graph and carries that knowledge, but serving it costs the same as serving a 32B dense model. You get frontier-level capability at mid-tier inference cost.

If MoE sounds excites you, check it out about more in core: [https://huggingface.co/blog/moe](https://huggingface.co/blog/moe) (read it, if you’re deep in AI/ML core & Maths, else it burns you)

There’s a conversation I’ve watched happen too many times in startup engineering rooms.

Product launches. Engineers are excited. They’ve integrated a closed model API and it’s working beautifully. Three months later, a new client asks: “Where does the data go when we call your AI?” And the room gets quiet because nobody actually read the data retention section of the API terms of service. Something personal experience, never mind :P

This is not paranoia. It’s engineering responsibility.

When you send prompts to a closed model API, you are sending data to another company’s infrastructure under their terms. For most personal projects and consumer tools, this is fine. For production workloads with proprietary business logic, sensitive user data, anything that touches financial information, healthcare, legal documents, or regulated industries “we accepted the terms” is not an answer a serious engineering team should be comfortable with.

Open weights change this entirely. You can run on your own infrastructure. Your prompts never leave your environment. No model provider has access to your users’ data, your codebase, or your business logic. For a startup building in a regulated space, this capability alone can be worth paying more for.

Even if full self-hosting isn’t feasible, using open-weight models through smaller inference providers means you’re working with companies whose data handling practices you can actually scrutinize, audit, and ask direct questions about. The opaque data practices of the largest closed model providers are harder to interrogate.

I’m not saying don’t use closed models. I use them for specific things where the integration value justifies it. I’m saying make the decision deliberately, not by default.

Here is what my actual monthly spend looks like: GLM-5.2 via OpenCode Go Plan at $10. Kimi K2.6 API usage for agent workflows at a cost low enough that I’m tracking it as a rounding error. Parakeet running locally at whatever my laptop’s electricity costs. Nemotron and DeepSeek API usage for automation workflows.

Total: somewhere in the $10–15 range per month. Some months less.

A Codex subscription is $20. Claude Max is more. A team running both which is what a lot of engineers are doing right now is spending significantly more than that every month. Without necessarily getting significantly more output.

This is not a post about being cheap. If you’re at a company with real revenue and the subscription tools improve your team’s velocity enough to justify the cost, that’s a real calculation and it might come out in favor of the subscriptions.

But for the solo developer, the indie hacker, the early-stage startup with three engineers and a cloud bill that’s already too high, this gap is real money. The money you don’t spend on AI subscriptions can buy you another month of runway. Or better tooling. Or a good engineer who can review your architecture before you regret it.

Months ago I was very much in the “use whatever has the best benchmark and the most X(Twitter) engagement” camp. I’m not embarrassed about this. Everyone goes through it. The social proof is loud and the alternative requires you to actually experiment with models that most of the English-language tech press isn’t writing about.

What I’d tell myself then is this: the benchmark is a ceiling test in a controlled environment. Your production use case is a different exam entirely. Run the model on your actual task for a week before making a decision. Not on a toy example. On a real problem from your actual codebase with real edge cases.

The open-weights ecosystem in 2026 is not what it was in 2023. GLM-5.2 matching GPT-5.5 on real agentic tasks at one-sixth the cost is not a quirk it’s the result of serious engineering talent in Chinese AI labs working on the same problems with access to serious compute. DeepSeek showing that frontier models don’t require nine-figure training budgets reset a lot of priors across the industry. Kimi shipping models that tie closed frontier benchmarks with full open weights and MIT licenses means the argument for paying subscription prices gets harder to make every quarter.

**Try new models** when they drop. Specifically, spend time with the ones that aren’t getting the most X(Twitter) attention, because the ones that are getting attention are also getting adopted by everyone else, which means the differentiation is gone. The interesting discoveries come from the models the generalist tech press hasn’t written fifteen threads about yet.

*Being Generalist is fine dude :)*

And if you find a model that works exceptionally well for your use case **contribute**. GitHub Sponsors, bug reports, eval contributions, writing about your experience. The labs/companies releasing open models are running on tighter margins than the closed model companies. The community is part of what makes the ecosystem stay open.

**TL;DR:** the best model for your problem is not the one with the highest score on a benchmark you will never run. It’s the one that does your actual task reliably, at a cost that makes sense for your situation, under terms you understand.

Everything else is marketing. Some of it is very well-produced marketing, but still.

**Choose your model like you choose your tech stack**, based on what you actually need to build, not based on what is getting the most applause on the internet this week.

Now if you’ll excuse me, I have a billing feature to ship with him.

[Choose Wisely: Models Should Follow Your Use Case.](https://pub.towardsai.net/choose-wisely-models-should-follow-your-use-case-9e1c420fbbf6) was originally published in [Towards AI](https://pub.towardsai.net) on Medium, where people are continuing the conversation by highlighting and responding to this story.
