# Bringing Scientific Rigor to LLM Comparison

> Source: <https://dev.to/lavellehatcherjr/bringing-scientific-rigor-to-llm-comparison-5a6l>
> Published: 2026-05-31 09:52:16+00:00

*Note: This is a personal project, not affiliated with any company. This does not constitute financial or investment advice.*

Every time I wanted to compare two LLMs, I had to pick between a quick spot check in a chat window or spinning up an entire evaluation platform.

One tells you nothing useful.

The other takes longer to set up than the comparison is worth.

So I built a CLI that does it from the terminal. It's called **Cli Modelarium**, and it's live on PyPI today under Apache 2.0.

```
pip install cli-modelarium
```

In the rest of this post I'll walk through what it does, why I built it, what's actually under the hood, and how you can use it for your own LLM comparison work in under a minute.

The LLM tooling landscape has two ends.

On one end you have the chat-window spot check. You paste a prompt into Claude, then into GPT, then into Gemini, eyeball the outputs, and decide which one is "better." This is what most developers actually do. It feels productive. It produces nothing trustworthy.

The problem with spot checks is that LLM output has variance. You can run the same prompt twice and get different answers. You can also run the same prompt across two models, get answers that look similar, and miss the fact that one is hallucinating subtle facts. Eyeballing single outputs is not a comparison. It's a vibe.

On the other end you have enterprise evaluation platforms. These exist and they're powerful. They also require you to set up an account, configure an integration, define a dataset schema, write evaluators, plug in providers, and orchestrate runs through a dashboard. By the time you've finished onboarding, the question you wanted to answer has changed.

Most LLM comparison questions don't justify that overhead. You want to know: which model produces better outputs for this specific prompt at this specific cost. You don't want a dashboard. You want an answer.

That's the gap Cli Modelarium fills.

You install it. You set your provider keys. You run a comparison. You get statistically rigorous results in your terminal.

Here's a real example:

```
cli-modelarium "Explain quantum entanglement." \
  --models claude-haiku-4-5,gpt-4o-mini \
  --max-cost 0.10
```

That command sends the prompt to both models through their official APIs, tracks cost per call against your `--max-cost`

cap so you don't accidentally spend more than 10 cents, measures time to first token and total latency for each model, and returns a side-by-side comparison with timing, cost, and full outputs.

No infrastructure. No dashboard. No account onboarding. Just a CLI that returns an answer.

If you want statistical rigor on top of that, you add a few flags:

```
cli-modelarium "Explain quantum entanglement." \
  --models claude-haiku-4-5,gpt-4o-mini,gemini-2.0-flash \
  --runs 10 \
  --significance \
  --hallucination-check \
  --judge claude-opus-4 \
  --max-cost 1.00
```

Now you're running 10 trials against each of three models, computing bootstrap confidence intervals, running paired significance tests, checking outputs for hallucination patterns, and using a separate model as a judge to score quality, all while staying under a $1 cost cap.

That's the gap I wanted to close. Publication-grade methodology, terminal-grade ergonomics.

The headline features are easy to list. The details are where the work lives.

**Multi-Provider Support.** Cli Modelarium supports 8 cloud LLM providers plus local models through a unified interface: OpenAI, Anthropic, Google, xAI, DeepSeek, Mistral, Groq, and OpenRouter. Each provider has its own SDK, its own auth pattern, its own error semantics, its own rate-limit behavior. The interface hides all of that. You specify `--models claude-haiku-4-5,gpt-4o-mini`

and the CLI figures out which provider to route each call to, handles credentials, and returns normalized outputs.

You only need API keys for the providers you actually want to use. You set them once with `cli-modelarium configure`

or via environment variables.

**Statistical Rigor.** This is where Cli Modelarium differs from every other LLM comparison tool I've used. LLM outputs have variance. To compare them rigorously you need actual statistics, not visual inspection. The CLI implements bootstrap confidence intervals using the BCa method, paired statistical tests including McNemar's test for binary outcomes, multiple comparison corrections including Bonferroni and Holm methods, and effect sizes using Cohen's d. These aren't decorative additions. They're the methods you'd use if you were writing a research paper comparing LLMs. The CLI just makes them invokable through flags.

**Hallucination Detection.** When models generate plausible-sounding nonsense, statistical tests don't catch it. Hallucination detection runs additional checks on outputs to flag responses that contain markers of fabrication: invented citations, contradictory claims within the same response, fabricated names or dates, and other patterns that experienced reviewers learn to spot. It's not perfect. No hallucination detector is. But it surfaces high-risk outputs for human review, which is far better than flying blind.

**LLM-as-Judge Panels.** For subjective quality questions, you can use a separate model as a judge. The CLI supports panels with multiple judge models voting independently to reduce single-judge bias.

**Cost and Latency Tracking.** Every comparison tracks cost per call, total cost, time to first token, and total latency per model. The `--max-cost`

flag enforces a hard cap. If your comparison would exceed the budget, the CLI stops before the next call and reports what it did.

This is the part that usually gets skipped. I think it matters because it's where a CLI either earns trust or doesn't.

**917 Automated Tests.** Every commit runs 917 automated tests covering provider integration, statistical computation accuracy, CLI behavior, error handling, and edge cases. Zero CI failures since v0.1.0.

**9 OS/Python Combinations.** The CI matrix runs on Linux, macOS, and Windows across Python 3.11, 3.12, and 3.13. That's 9 combinations on every push. If something works on Linux but breaks on Windows, I know before users do.

**Statistical Validation Against Literature.** For the statistical methods, "passes tests" isn't enough. I cross-validated outputs against reference implementations: bootstrap CIs against scipy's bootstrap method with BCa correction, McNemar's test against binomtest for small samples and chi2.sf with Edwards correction for larger samples, and effect sizes against published formulas for Cohen's d. When my implementation disagreed with the reference, I traced the discrepancy to its source and fixed it.

**README in 9 Languages.** The README is available in 9 languages so developers across different regions can read about the project in their preferred language.

Every time I evaluate a new LLM for one of my projects, I run into the same problem: I want to know if Claude is better than GPT for this specific task, or if Gemini is fast enough for that other use case, or whether DeepSeek is worth the cost savings.

I'd open three browser tabs. I'd paste prompts. I'd squint at outputs. I'd make a call. Then later I would revisit the decision and realize I didn't remember why I picked what I picked.

The CLI started as a script for my own evaluation work. Then I added statistical methods because I wanted to know if differences I was seeing were real or noise. Then I added cost tracking because I was burning through API credits. Then I added the test suite because I kept introducing regressions.

At some point I looked at the project and realized I'd built something other people would find useful, so I open sourced it under Apache 2.0.

The full workflow takes about 30 seconds.

**Install:**

```
pip install cli-modelarium
```

Works on Linux, macOS, and Windows. Python 3.11 or newer.

**Set your provider keys** (you only need keys for providers you'll actually use):

```
cli-modelarium configure
```

**Run your first comparison:**

```
cli-modelarium "Explain quantum entanglement." \
  --models claude-haiku-4-5,gpt-4o-mini \
  --max-cost 0.10
```

**With statistical rigor:**

```
cli-modelarium "Explain quantum entanglement." \
  --models claude-haiku-4-5,gpt-4o-mini,gemini-2.0-flash \
  --runs 10 \
  --significance \
  --hallucination-check \
  --output results.json \
  --max-cost 1.00
```

**Compare against a local model:**

```
cli-modelarium "Explain quantum entanglement." \
  --models claude-haiku-4-5,local:llama-3.1-8b \
  --max-cost 0.10
```

The launch version (v0.1.3) is feature-complete for the use cases I had when I built it. Additional language support is on my radar. The architecture is designed for it, so stay tuned.

If you use it and find something missing, open an issue.

```
pip install cli-modelarium
```

I built this because I needed it. I open sourced it because if I needed it, other people probably do too.

If you find it useful, a star on the repo helps surface it to other developers.
