Stop Shipping ML Models With Bare Floats: A Deep Dive Into Statistically Rigorous Model Evaluation

wpnews.pro

cd /news/machine-learning/stop-shipping-ml-models-with-bare-fl… · home › topics › machine-learning › article

[ARTICLE · art-28494] src=dev.to ↗ pub=2026-06-15T19:44Z topic=machine-learning verified=true sentiment=↑ positive

Stop Shipping ML Models With Bare Floats: A Deep Dive Into Statistically Rigorous Model Evaluation

A developer built reliably-metrics, an open-source Python library that adds confidence intervals and statistical significance tests to common ML evaluation metrics like AUROC, ECE, and Brier score. The library automatically computes 95% confidence intervals and performs appropriate statistical tests (e.g., DeLong test for AUROC) to help teams make deployment decisions based on uncertainty-aware estimates rather than raw point estimates. It also supports calibration evaluation and recalibration, reliability diagrams with uncertainty bands, and generates HTML reports.

read4 min views18 publishedJun 15, 2026

Every week, somewhere, a team makes a deployment decision that looks like this:

Model A: AUROC = 0.847
Model B: AUROC = 0.851

They ship Model B.

Maybe it's better.

Maybe it's noise.

Nobody knows—because nobody computed a confidence interval.

That's exactly why I built reliably-metrics.

Most ML evaluation today looks like this:

print(f"AUROC = {auroc:.4f}")

Output:

AUROC = 0.8512

Looks precise.

Looks scientific.

But it tells you almost nothing about uncertainty.

Metrics are estimates computed from finite samples. Without uncertainty quantification, you're making decisions using a single point estimate and hoping it's representative.

Consider two models evaluated on 500 test samples:

Model A: AUROC = 0.847
Model B: AUROC = 0.851
Difference = +0.004

Is that improvement real?

Or would it disappear if you collected another batch of test data?

Most ML tooling doesn't answer that question.

reliably-metrics

pip install reliably-metrics

Basic evaluation:

import reliably as rb

report = rb.evaluate(y_true, y_prob)

print(report.summary())

Output:

Report(task=binary, n=500)
  ECE=0.0412 [0.0287, 0.0541]
  smECE=0.0389 [0.0261, 0.0523]
  Brier=0.1834 [0.1612, 0.2063]
  NLL=0.4821 [0.4503, 0.5148]
  AUROC=0.8234 [0.7941, 0.8509]

Notice something different?

Every metric comes with a 95% confidence interval.

No extra code.

No manual bootstrap implementation.

No statistics package required.

Instead of comparing raw metric values, compare uncertainty-aware estimates.

result = rb.compare(
    model_a,
    model_b,
    metric="auroc",
    y_true=y_true
)

print(f"Delta: {result.delta:+.4f}")
print(f"95% CI: [{result.ci.low:.4f}, {result.ci.high:.4f}]")
print(f"p-value: {result.p_value:.4f}")
print(f"Significant: {result.significant}")

Output:

Delta: +0.0182
95% CI: [-0.0031, 0.0396]
p-value: 0.094
Significant: False

Interpretation:

Translation:

Don't deploy Model B yet.

The library automatically selects the appropriate test:

Metric	Statistical Method
AUROC	DeLong Test
Other Metrics	Paired Bootstrap
Multiple Comparisons	Holm–Bonferroni Correction

A model can have excellent accuracy while being poorly calibrated.

If your model outputs:

predict_proba = 0.90

it should be correct approximately 90% of the time.

In practice, many production systems are far from this ideal.

report_before = rb.evaluate(
    y_true,
    y_prob
)

print(report_before["ECE"])

Output:

ECE=0.0821 [0.0612, 0.1034]
cal = rb.recalibrate(
    y_true,
    y_prob,
    method="temperature"
)

y_prob_cal = cal.predict(y_prob_test)
report_after = rb.evaluate(
    y_true_test,
    y_prob_cal
)

print(report_after["ECE"])

Output:

ECE=0.0241 [0.0143, 0.0352]

Supported methods:

Most calibration plots show a line and leave interpretation to the reader.

reliably-metrics

can visualize uncertainty directly.

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(6, 6))

report.reliability_diagram(
    y_true,
    y_prob,
    ax=ax,
    band=True
)

plt.savefig(
    "calibration.png",
    dpi=150
)

The shaded region represents a bootstrap confidence band around the calibration curve.

This helps distinguish real calibration errors from random fluctuations.

Need a report for teammates or stakeholders?

report.to_html(
    path="model_report.html"
)

That's it.

The generated report contains:

No Jupyter notebook required.

Core installation:

pip install reliably-metrics

Visualization support:

pip install reliably-metrics[viz]

HTML reporting:

pip install reliably-metrics[report]

Everything:

pip install reliably-metrics[all]

Heavy dependencies are loaded only when needed.

Traditional bootstrap implementations often look like this:

for i in range(10000):
    sample = resample(data)
    metric = compute_metric(sample)

That means 10,000 Python loops.

reliably-metrics

instead generates all bootstrap indices up front and performs calculations using vectorized NumPy operations.

The result:

Every stochastic operation accepts an explicit seed.

report = rb.evaluate(
    y_true,
    y_prob,
    seed=42
)

Same data.

Same seed.

Same output.

Always.

Many libraries claim statistical rigor.

We verify it.

The test suite repeatedly generates synthetic datasets with known ground-truth metrics and checks empirical confidence interval coverage.

If a nominal 95% confidence interval stops covering the true value approximately 95% of the time, CI tests fail.

Statistical correctness isn't just documentation—it's enforced in continuous integration.

If you're working on:

the library also includes disentanglement evaluation metrics.

from reliably.repr import disentanglement

results = disentanglement(
    z,
    factors,
    metrics=(
        "mig",
        "sap",
        "dci",
        "factorvae",
        "irs"
    )
)

print(results["mig"])

Output:

MIG=0.312 [0.271, 0.354]

Included metrics:

All reported with bootstrap confidence intervals.

The project is still in its early stages, and contributions are welcome.

GitHub

https://github.com/nischal1234/reliably

Documentation

https://reliably.readthedocs.io

PyPI

pip install reliably-metrics

Machine learning has become incredibly good at reporting tiny metric improvements.

We're much worse at determining whether those improvements are actually real.

A model with:

AUROC = 0.851

isn't enough.

What you really need is:

AUROC = 0.851 [0.812, 0.887]

Because uncertainty isn't optional.

It's part of the measurement.

Let's make statistically rigorous ML evaluation the default—not the exception.

source & further reading

dev.to — original article More Compute Won't Wake It Up Your AI Coding Agent Is LYING When It Says "Done" Spring AI: Bringing Generative AI into Spring Boot Applications

~/api · this article 200

$curl api.wpnews.pro/v1/news/stop-shipping-ml-models-…

Read original on dev.to → dev.to/nischal_mandal_bc08e73405/stop-shipping-m…

mentioned entities

reliably-metrics

DeLong Test

Holm–Bonferroni Correction

metadata

slugstop-shipping-ml-models-with-bare-floats-a-deep-dive-into-statistically-rigorous

topic#machine-learning

secondary2 topics

sentimentpositive

canonicaldev.to

navigation

← prevHow to Use AI to Redact PII in L…

next →AgentBack: AI-native API/MCP fra…

── more in #machine-learning 4 stories · sorted by recency

dev.to · 31 Jul · #machine-learning

Your AI Coding Agent Is LYING When It Says "Done"

byteiota.com · 31 Jul · #machine-learning

LLMD: Run LLM Inference on Any Chip, One Docker Tag

news.ycombinator.com · 31 Jul · #machine-learning

First AI Agent in 30 Minutes

github.com · 31 Jul · #machine-learning

Show HN: STE-Code a distillation and adaptation of ASD-STE100 for code

── more on @reliably-metrics 3 stories trending now

wpnews · 30 Jul · #artificial-intelligence

Microsoft and Meta Earnings Show Different AI Spending Pressures

wpnews · 31 Jul · #artificial-intelligence

Microsoft doubles down on multi-model AI as it builds a Copilot super app

wpnews · 30 Jul · #artificial-intelligence

Apple to join Samsung in AI glasses race against Meta

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required