Quick release note. Cli-Modelarium 0.1.4 just shipped, and the headline is two new providers.
You can now compare Alibaba's Qwen models (via DashScope) and Z.AI's GLM models side by side with the rest of the lineup: OpenAI, Anthropic, Google, xAI, DeepSeek, Mistral, Groq, OpenRouter, plus your local models. That brings it to 10 cloud providers.
If you have wanted to benchmark the open-weight models against the frontier ones on your own prompts, it is now a single command:
pip install --upgrade cli-modelarium
cli-modelarium "Write a haiku about garbage collection in programming" \
--models qwen3.7-max,glm-5.2,gpt-5.4,claude-opus-4-8 \
--runs 10 --max-cost 0.50
You get a side by side table with cost and latency per model. With --runs
greater than 1 it repeats the trials and runs the statistical tests automatically, so you can tell a real difference from noise instead of eyeballing one output. The --max-cost
flag is a hard cap, so a multi-model run does not surprise your API bill.
Cli-Modelarium is a command line tool for comparing LLM outputs side by side, with real statistics (bootstrap confidence intervals, paired significance tests, McNemar's), CI-ready assertions, hallucination detection, LLM-as-judge scoring, and cost tracking. One pip install, no infrastructure, Apache 2.0.
Would love to hear how the new providers work for your use case.