cd /news/machine-learning/stop-shipping-ml-models-with-bare-fl… · home topics machine-learning article
[ARTICLE · art-28494] src=dev.to ↗ pub= topic=machine-learning verified=true sentiment=↑ positive

Stop Shipping ML Models With Bare Floats: A Deep Dive Into Statistically Rigorous Model Evaluation

A developer built reliably-metrics, an open-source Python library that adds confidence intervals and statistical significance tests to common ML evaluation metrics like AUROC, ECE, and Brier score. The library automatically computes 95% confidence intervals and performs appropriate statistical tests (e.g., DeLong test for AUROC) to help teams make deployment decisions based on uncertainty-aware estimates rather than raw point estimates. It also supports calibration evaluation and recalibration, reliability diagrams with uncertainty bands, and generates HTML reports.

read4 min views3 publishedJun 15, 2026

Every week, somewhere, a team makes a deployment decision that looks like this:

Model A: AUROC = 0.847
Model B: AUROC = 0.851

They ship Model B.

Maybe it's better.

Maybe it's noise.

Nobody knows—because nobody computed a confidence interval.

That's exactly why I built reliably-metrics.

Most ML evaluation today looks like this:

print(f"AUROC = {auroc:.4f}")

Output:

AUROC = 0.8512

Looks precise.

Looks scientific.

But it tells you almost nothing about uncertainty.

Metrics are estimates computed from finite samples. Without uncertainty quantification, you're making decisions using a single point estimate and hoping it's representative.

Consider two models evaluated on 500 test samples:

Model A: AUROC = 0.847
Model B: AUROC = 0.851
Difference = +0.004

Is that improvement real?

Or would it disappear if you collected another batch of test data?

Most ML tooling doesn't answer that question.

reliably-metrics

pip install reliably-metrics

Basic evaluation:

import reliably as rb

report = rb.evaluate(y_true, y_prob)

print(report.summary())

Output:

Report(task=binary, n=500)
  ECE=0.0412 [0.0287, 0.0541]
  smECE=0.0389 [0.0261, 0.0523]
  Brier=0.1834 [0.1612, 0.2063]
  NLL=0.4821 [0.4503, 0.5148]
  AUROC=0.8234 [0.7941, 0.8509]

Notice something different?

Every metric comes with a 95% confidence interval.

No extra code.

No manual bootstrap implementation.

No statistics package required.

Instead of comparing raw metric values, compare uncertainty-aware estimates.

result = rb.compare(
    model_a,
    model_b,
    metric="auroc",
    y_true=y_true
)

print(f"Delta: {result.delta:+.4f}")
print(f"95% CI: [{result.ci.low:.4f}, {result.ci.high:.4f}]")
print(f"p-value: {result.p_value:.4f}")
print(f"Significant: {result.significant}")

Output:

Delta: +0.0182
95% CI: [-0.0031, 0.0396]
p-value: 0.094
Significant: False

Interpretation:

Translation:

Don't deploy Model B yet.

The library automatically selects the appropriate test:

Metric Statistical Method
AUROC DeLong Test
Other Metrics Paired Bootstrap
Multiple Comparisons Holm–Bonferroni Correction

A model can have excellent accuracy while being poorly calibrated.

If your model outputs:

predict_proba = 0.90

it should be correct approximately 90% of the time.

In practice, many production systems are far from this ideal.

report_before = rb.evaluate(
    y_true,
    y_prob
)

print(report_before["ECE"])

Output:

ECE=0.0821 [0.0612, 0.1034]
cal = rb.recalibrate(
    y_true,
    y_prob,
    method="temperature"
)

y_prob_cal = cal.predict(y_prob_test)
report_after = rb.evaluate(
    y_true_test,
    y_prob_cal
)

print(report_after["ECE"])

Output:

ECE=0.0241 [0.0143, 0.0352]

Supported methods:

Most calibration plots show a line and leave interpretation to the reader.

reliably-metrics

can visualize uncertainty directly.

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(6, 6))

report.reliability_diagram(
    y_true,
    y_prob,
    ax=ax,
    band=True
)

plt.savefig(
    "calibration.png",
    dpi=150
)

The shaded region represents a bootstrap confidence band around the calibration curve.

This helps distinguish real calibration errors from random fluctuations.

Need a report for teammates or stakeholders?

report.to_html(
    path="model_report.html"
)

That's it.

The generated report contains:

No Jupyter notebook required.

Core installation:

pip install reliably-metrics

Visualization support:

pip install reliably-metrics[viz]

HTML reporting:

pip install reliably-metrics[report]

Everything:

pip install reliably-metrics[all]

Heavy dependencies are loaded only when needed.

Traditional bootstrap implementations often look like this:

for i in range(10000):
    sample = resample(data)
    metric = compute_metric(sample)

That means 10,000 Python loops.

reliably-metrics

instead generates all bootstrap indices up front and performs calculations using vectorized NumPy operations.

The result:

Every stochastic operation accepts an explicit seed.

report = rb.evaluate(
    y_true,
    y_prob,
    seed=42
)

Same data.

Same seed.

Same output.

Always.

Many libraries claim statistical rigor.

We verify it.

The test suite repeatedly generates synthetic datasets with known ground-truth metrics and checks empirical confidence interval coverage.

If a nominal 95% confidence interval stops covering the true value approximately 95% of the time, CI tests fail.

Statistical correctness isn't just documentation—it's enforced in continuous integration.

If you're working on:

the library also includes disentanglement evaluation metrics.

from reliably.repr import disentanglement

results = disentanglement(
    z,
    factors,
    metrics=(
        "mig",
        "sap",
        "dci",
        "factorvae",
        "irs"
    )
)

print(results["mig"])

Output:

MIG=0.312 [0.271, 0.354]

Included metrics:

All reported with bootstrap confidence intervals.

The project is still in its early stages, and contributions are welcome.

GitHub

https://github.com/nischal1234/reliably

Documentation

https://reliably.readthedocs.io

PyPI

pip install reliably-metrics

Machine learning has become incredibly good at reporting tiny metric improvements.

We're much worse at determining whether those improvements are actually real.

A model with:

AUROC = 0.851

isn't enough.

What you really need is:

AUROC = 0.851 [0.812, 0.887]

Because uncertainty isn't optional.

It's part of the measurement.

Let's make statistically rigorous ML evaluation the default—not the exception.

── more in #machine-learning 4 stories · sorted by recency
── more on @reliably-metrics 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/stop-shipping-ml-mod…] indexed:0 read:4min 2026-06-15 ·