{"slug": "comparing-open-weight-ai-models-and-providers", "title": "Comparing open weight AI models and providers", "summary": "Wagtail developer Thibaud Colas published a guide comparing open weight AI models and inference providers, emphasizing transparency and decoupled infrastructure as key advantages. The post outlines criteria for selecting models, including license, context window, capabilities, benchmark performance, and energy use, with tools like WebDev Arena and Artificial Analysis for comparison.", "body_md": "# Comparing open weight AI models and providers\n\n## Why and how we select open weight LLMs\n\nIt’s been a few months since we first reported on [our AI usage when working on Wagtail](/blog/open-source-ai-we-use-to-work-on-wagtail/), sharing open weight models we’re working with, as well as AI inference providers. With AI adoption levels so high in our [DX with AI survey](/blog/2026-ai-dx-survey/), here’s a \"part two\", focusing on **how to select open weight models and inference providers**.\n\nNote: this is primarily focusing on agentic SWE / development tasks, with an emphasis on coding in particular. There might be other more important criteria for other kinds of AI use!\n\n## Why we prefer open weight\n\nThere’s no “silver bullet” LLM or provider that meets all our criteria. They mostly tend to fall way short compared to [what we value](https://wagtail.org/blog/ai-in-the-cms-steering-the-ecosystem/) when it comes to AI adoption. But open weight models have two key advantages that make them work out much better in practice:\n\n**Much greater transparency**. Since the model weights and model cards are provided for anyone and everyone to see, you know much more about how the models were trained and subsequently run. For example, you can easily estimate the energy use of AI inference.**Decoupled model providers and inference infrastructure providers**. This leads to market dynamics that are more favorable to inference consumers like us. Infrastructure companies are incentivized to compete with more efficient servers, and cheaper prices.\n\nMy favorite example of the above two in practice is [Neuralwatt](https://portal.neuralwatt.com/)’s energy use dashboard, which shows how much energy exactly I’ve used up while coding with Kimi K2.6:\n\nThose two advantages do result in much more work when comparing models / providers -- we have much more options, and info about them. So let’s go through our criteria!\n\n## Criteria for models\n\nWhen reviewing inference options, it’s natural to be drawn to comparing models first - as their capability scores are what we all keep track of on benchmarks. Here’s what I look out for personally, as pass-fail criteria:\n\n**Model license**: is it open weight or open source. Are there specific restrictions. There are very few truly open models ([Apertus](https://apertvs.ai/),[Olmo 3 from Allen Institute](https://allenai.org/olmo)).**Large context window length**. A good indication of the ease of use of the model. Anything above 200k tends to work plenty well enough for any task I might want to throw at it.**Input-output capabilities**. Support for tool use, image input, JSON schema.\n\nAnd the criteria that aren’t pass-fail on top, to rank models:\n\n**Position on the*** pareto frontier***in benchmarks***.*A fancy term for \"capability per cost\". A model is considered \"pareto optimal\" if it’s currently the best at a given price point.**Energy use per request**. A great proxy for the long-term cost of using the model, that isn’t influenced by any one provider’s willingness to subsidize your usage or any other aspect of their market adoption strategy.\n\nThe [WebDev Arena AI leaderboard](https://arena.ai/leaderboard/code/webdev/pareto) has a built-in pareto frontier visualization that you can combine with filtering to only view open weight models:\n\nArtificial Analysis also provide [excellent filterable visualizations](https://artificialanalysis.ai/?intelligence=coding-index&cost=intelligence-vs-cost-per-task&model-filters=open-source&models=gpt-oss-20b%2Cgpt-oss-20b-low%2Cgpt-oss-120b-low%2Cgpt-oss-120b%2Cllama-4-maverick%2Cgemma-4-31b-non-reasoning%2Cgemma-4-26b-a4b%2Cgemma-4-31b%2Cmistral-medium-3-5%2Cdeepseek-v4-flash%2Cdeepseek-v4-pro%2Cminimax-m3%2Cnvidia-nemotron-3-ultra-550b-a55b%2Ckimi-k2-6%2Ckimi-k2-6-non-reasoning%2Ckimi-k2-7-code%2Cmimo-v2-5-pro%2Ck2-think-v2%2Capertus-70b-instruct%2Cglm-5-2%2Cqwen3-5-397b-a17b) of similar information, though it takes more effort to grasp:\n\nThis filtering combines all of our criteria for models, except energy use. We need other datasets for that. The [French government’s compar:IA models leaderboard](https://comparia.beta.gouv.fr/ranking) has data for a wide range of models. Here are notable highlights when we look at both performance and energy use:\n\n[Gemma 4 models](https://deepmind.google/models/gemma/gemma-4/)from Google DeepMind score super high for such efficient models.[GLM 5.2](https://z.ai/blog/glm-5.2)from z.ai and[Kimi](https://www.kimi.com/ai-models/kimi-k2-6)from Moonshot rank higher in performance but also much higher in energy use.[DeepSeek V4 Pro](https://api-docs.deepseek.com/)is even a step higher in energy use, but without necessarily being more performant (benchmark-dependent).\n\nAbstract model performance and efficiency numbers are great, but when it comes to coding usage you’ll also want to see how models actually run with providers. So let’s look at that next!\n\n## Criteria for providers\n\nThere are a lot of options here too, and the ecosystem is moving fast as more and more AI users want more control and agency over how and where their usage is served. Here are the key criteria we consider, pass-fail:\n\n**Sovereignty**. Broadly-speaking, knowing which states and companies have control over the provider. For example, when looking at privacy / data protection, you want to know where the data is stored and processed but also by which entities and within what legal frameworks they operate.**Model breadth**. Having 3+ model families that are (roughly) around the pareto frontier in their recent iterations.** Model availability lifecycle.**As new models and model families get released all the time, it’s really important to know how well the provider can handle this. The server requirements of flagship AI models are enormous, so providers have to decide carefully which models to introduce in their service and also when to retire previous models.\n\nAnd requirements that allow us to better rank models afterwards:\n\n**Environmental sustainability**. We want to know how much energy the LLM inference takes, and also overall data center efficiency, and the carbon footprint of the electricity source (local grid or otherwise).**Cache hit rates**. This is[crucial to the cost of agentic work](https://sankalp.bearblog.dev/how-prompt-caching-works/), and lots of providers[aren’t doing too well on this](https://dirac.run/posts/cache-hit-rates-agents). Some don’t even provide separate pricing for cache access, which is a red flag.\n\nWe chose Scaleway’s Generative APIs service back in 2025 to support our AI adoption while meeting our [AI guiding principles](/blog/ai-in-the-cms-steering-the-ecosystem/), but these days we’re back at looking at a lot of alternatives. Here are the most promising options for us:\n\n[Scaleway](https://www.scaleway.com/en/generative-apis/). They fare excellently on model lifecycle ([clearest policy](https://www.scaleway.com/en/docs/generative-apis/reference-content/model-lifecycle/)) and sovereignty (EU company, EU data centers) but their cache support and model breadth is too limiting.[Neuralwatt](https://portal.neuralwatt.com/models). Their unique energy transparency and pricing model is unmatched.[TensorX](https://tensorx.ai/models/). EU company working with US (Ireland) and EU (Finland) data center providers. Great model breadth but the lifecycle is unclear.[Argyll Data](https://argylldev.com/models). A new UK-based provider with plans to build their own data center using renewable energy in Scotland.\n\n## Where next\n\nWe hope this information helps with transparency, and encourages more people to give a go to open weight models and providers! There’s a lot to navigate but it really doesn’t take that much time to get set up with a provider running a flagship Chinese model, at Opus-comparable performance but 10x cheaper. Or pick Gemma 4 for an even lower footprint.\n\nFor Wagtail, we’d love to demonstrate how to compare the output of multiple models on given agentic tasks in the future. Perhaps a [Wagtail upgrade skill](/blog/an-agent-skill-to-upgrade-your-wagtail-site/) benchmark? That’d make for a great [Wagtail Space 2026](/wagtail-space-2026/) talk proposal 🤫", "url": "https://wpnews.pro/news/comparing-open-weight-ai-models-and-providers", "canonical_source": "https://wagtail.org/blog/comparing-open-weight-ai-models-and-providers/", "published_at": "2026-07-01 12:40:35+00:00", "updated_at": "2026-07-01 12:50:41.283231+00:00", "lang": "en", "topics": ["large-language-models", "ai-tools", "ai-infrastructure", "developer-tools"], "entities": ["Wagtail", "Thibaud Colas", "Neuralwatt", "Kimi K2.6", "Apertus", "Allen Institute", "WebDev Arena", "Artificial Analysis"], "alternates": {"html": "https://wpnews.pro/news/comparing-open-weight-ai-models-and-providers", "markdown": "https://wpnews.pro/news/comparing-open-weight-ai-models-and-providers.md", "text": "https://wpnews.pro/news/comparing-open-weight-ai-models-and-providers.txt", "jsonld": "https://wpnews.pro/news/comparing-open-weight-ai-models-and-providers.jsonld"}}