# Stop Shipping ML Models With Bare Floats: A Deep Dive Into Statistically Rigorous Model Evaluation

> Source: <https://dev.to/nischal_mandal_bc08e73405/stop-shipping-ml-models-with-bare-floats-a-deep-dive-into-statistically-rigorous-model-evaluation-394p>
> Published: 2026-06-15 19:44:49+00:00

Every week, somewhere, a team makes a deployment decision that looks like this:

```
Model A: AUROC = 0.847
Model B: AUROC = 0.851
```

They ship Model B.

Maybe it's better.

Maybe it's noise.

Nobody knows—because nobody computed a confidence interval.

That's exactly why I built **reliably-metrics**.

Most ML evaluation today looks like this:

```
print(f"AUROC = {auroc:.4f}")
```

Output:

```
AUROC = 0.8512
```

Looks precise.

Looks scientific.

But it tells you almost nothing about uncertainty.

Metrics are estimates computed from finite samples. Without uncertainty quantification, you're making decisions using a single point estimate and hoping it's representative.

Consider two models evaluated on 500 test samples:

```
Model A: AUROC = 0.847
Model B: AUROC = 0.851
Difference = +0.004
```

Is that improvement real?

Or would it disappear if you collected another batch of test data?

Most ML tooling doesn't answer that question.

`reliably-metrics`

```
pip install reliably-metrics
```

Basic evaluation:

``` python
import reliably as rb

report = rb.evaluate(y_true, y_prob)

print(report.summary())
```

Output:

```
Report(task=binary, n=500)
  ECE=0.0412 [0.0287, 0.0541]
  smECE=0.0389 [0.0261, 0.0523]
  Brier=0.1834 [0.1612, 0.2063]
  NLL=0.4821 [0.4503, 0.5148]
  AUROC=0.8234 [0.7941, 0.8509]
```

Notice something different?

Every metric comes with a 95% confidence interval.

No extra code.

No manual bootstrap implementation.

No statistics package required.

Instead of comparing raw metric values, compare uncertainty-aware estimates.

```
result = rb.compare(
    model_a,
    model_b,
    metric="auroc",
    y_true=y_true
)

print(f"Delta: {result.delta:+.4f}")
print(f"95% CI: [{result.ci.low:.4f}, {result.ci.high:.4f}]")
print(f"p-value: {result.p_value:.4f}")
print(f"Significant: {result.significant}")
```

Output:

```
Delta: +0.0182
95% CI: [-0.0031, 0.0396]
p-value: 0.094
Significant: False
```

Interpretation:

Translation:

Don't deploy Model B yet.

The library automatically selects the appropriate test:

| Metric | Statistical Method |
|---|---|
| AUROC | DeLong Test |
| Other Metrics | Paired Bootstrap |
| Multiple Comparisons | Holm–Bonferroni Correction |

A model can have excellent accuracy while being poorly calibrated.

If your model outputs:

```
predict_proba = 0.90
```

it should be correct approximately 90% of the time.

In practice, many production systems are far from this ideal.

```
report_before = rb.evaluate(
    y_true,
    y_prob
)

print(report_before["ECE"])
```

Output:

```
ECE=0.0821 [0.0612, 0.1034]
cal = rb.recalibrate(
    y_true,
    y_prob,
    method="temperature"
)

y_prob_cal = cal.predict(y_prob_test)
report_after = rb.evaluate(
    y_true_test,
    y_prob_cal
)

print(report_after["ECE"])
```

Output:

```
ECE=0.0241 [0.0143, 0.0352]
```

Supported methods:

Most calibration plots show a line and leave interpretation to the reader.

`reliably-metrics`

can visualize uncertainty directly.

``` python
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(6, 6))

report.reliability_diagram(
    y_true,
    y_prob,
    ax=ax,
    band=True
)

plt.savefig(
    "calibration.png",
    dpi=150
)
```

The shaded region represents a bootstrap confidence band around the calibration curve.

This helps distinguish real calibration errors from random fluctuations.

Need a report for teammates or stakeholders?

```
report.to_html(
    path="model_report.html"
)
```

That's it.

The generated report contains:

No Jupyter notebook required.

Core installation:

```
pip install reliably-metrics
```

Visualization support:

```
pip install reliably-metrics[viz]
```

HTML reporting:

```
pip install reliably-metrics[report]
```

Everything:

```
pip install reliably-metrics[all]
```

Heavy dependencies are loaded only when needed.

Traditional bootstrap implementations often look like this:

```
for i in range(10000):
    sample = resample(data)
    metric = compute_metric(sample)
```

That means 10,000 Python loops.

`reliably-metrics`

instead generates all bootstrap indices up front and performs calculations using vectorized NumPy operations.

The result:

Every stochastic operation accepts an explicit seed.

```
report = rb.evaluate(
    y_true,
    y_prob,
    seed=42
)
```

Same data.

Same seed.

Same output.

Always.

Many libraries claim statistical rigor.

We verify it.

The test suite repeatedly generates synthetic datasets with known ground-truth metrics and checks empirical confidence interval coverage.

If a nominal 95% confidence interval stops covering the true value approximately 95% of the time, CI tests fail.

Statistical correctness isn't just documentation—it's enforced in continuous integration.

If you're working on:

the library also includes disentanglement evaluation metrics.

``` python
from reliably.repr import disentanglement

results = disentanglement(
    z,
    factors,
    metrics=(
        "mig",
        "sap",
        "dci",
        "factorvae",
        "irs"
    )
)

print(results["mig"])
```

Output:

```
MIG=0.312 [0.271, 0.354]
```

Included metrics:

All reported with bootstrap confidence intervals.

The project is still in its early stages, and contributions are welcome.

**GitHub**

[https://github.com/nischal1234/reliably](https://github.com/nischal1234/reliably)

**Documentation**

[https://reliably.readthedocs.io](https://reliably.readthedocs.io)

**PyPI**

```
pip install reliably-metrics
```

Machine learning has become incredibly good at reporting tiny metric improvements.

We're much worse at determining whether those improvements are actually real.

A model with:

```
AUROC = 0.851
```

isn't enough.

What you really need is:

```
AUROC = 0.851 [0.812, 0.887]
```

Because uncertainty isn't optional.

It's part of the measurement.

Let's make statistically rigorous ML evaluation the default—not the exception.
