Stop Shipping ML Models With Bare Floats: A Deep Dive Into Statistically Rigorous Model Evaluation

A developer built reliably-metrics, an open-source Python library that adds confidence intervals and statistical significance tests to common ML evaluation metrics like AUROC, ECE, and Brier score. The library automatically computes 95% confidence intervals and performs appropriate statistical tests (e.g., DeLong test for AUROC) to help teams make deployment decisions based on uncertainty-aware estimates rather than raw point estimates. It also supports calibration evaluation and recalibration, reliability diagrams with uncertainty bands, and generates HTML reports.

Every week, somewhere, a team makes a deployment decision that looks like this: Model A: AUROC = 0.847 Model B: AUROC = 0.851 They ship Model B. Maybe it's better. Maybe it's noise. Nobody knows—because nobody computed a confidence interval. That's exactly why I built reliably-metrics . Most ML evaluation today looks like this: print f"AUROC = {auroc:.4f}" Output: AUROC = 0.8512 Looks precise. Looks scientific. But it tells you almost nothing about uncertainty. Metrics are estimates computed from finite samples. Without uncertainty quantification, you're making decisions using a single point estimate and hoping it's representative. Consider two models evaluated on 500 test samples: Model A: AUROC = 0.847 Model B: AUROC = 0.851 Difference = +0.004 Is that improvement real? Or would it disappear if you collected another batch of test data? Most ML tooling doesn't answer that question. reliably-metrics pip install reliably-metrics Basic evaluation: python import reliably as rb report = rb.evaluate y true, y prob print report.summary Output: Report task=binary, n=500 ECE=0.0412 0.0287, 0.0541 smECE=0.0389 0.0261, 0.0523 Brier=0.1834 0.1612, 0.2063 NLL=0.4821 0.4503, 0.5148 AUROC=0.8234 0.7941, 0.8509 Notice something different? Every metric comes with a 95% confidence interval. No extra code. No manual bootstrap implementation. No statistics package required. Instead of comparing raw metric values, compare uncertainty-aware estimates. result = rb.compare model a, model b, metric="auroc", y true=y true print f"Delta: {result.delta:+.4f}" print f"95% CI: {result.ci.low:.4f}, {result.ci.high:.4f} " print f"p-value: {result.p value:.4f}" print f"Significant: {result.significant}" Output: Delta: +0.0182 95% CI: -0.0031, 0.0396 p-value: 0.094 Significant: False Interpretation: Translation: Don't deploy Model B yet. The library automatically selects the appropriate test: | Metric | Statistical Method | |---|---| | AUROC | DeLong Test | | Other Metrics | Paired Bootstrap | | Multiple Comparisons | Holm–Bonferroni Correction | A model can have excellent accuracy while being poorly calibrated. If your model outputs: predict proba = 0.90 it should be correct approximately 90% of the time. In practice, many production systems are far from this ideal. report before = rb.evaluate y true, y prob print report before "ECE" Output: ECE=0.0821 0.0612, 0.1034 cal = rb.recalibrate y true, y prob, method="temperature" y prob cal = cal.predict y prob test report after = rb.evaluate y true test, y prob cal print report after "ECE" Output: ECE=0.0241 0.0143, 0.0352 Supported methods: Most calibration plots show a line and leave interpretation to the reader. reliably-metrics can visualize uncertainty directly. python import matplotlib.pyplot as plt fig, ax = plt.subplots figsize= 6, 6 report.reliability diagram y true, y prob, ax=ax, band=True plt.savefig "calibration.png", dpi=150 The shaded region represents a bootstrap confidence band around the calibration curve. This helps distinguish real calibration errors from random fluctuations. Need a report for teammates or stakeholders? report.to html path="model report.html" That's it. The generated report contains: No Jupyter notebook required. Core installation: pip install reliably-metrics Visualization support: pip install reliably-metrics viz HTML reporting: pip install reliably-metrics report Everything: pip install reliably-metrics all Heavy dependencies are loaded only when needed. Traditional bootstrap implementations often look like this: for i in range 10000 : sample = resample data metric = compute metric sample That means 10,000 Python loops. reliably-metrics instead generates all bootstrap indices up front and performs calculations using vectorized NumPy operations. The result: Every stochastic operation accepts an explicit seed. report = rb.evaluate y true, y prob, seed=42 Same data. Same seed. Same output. Always. Many libraries claim statistical rigor. We verify it. The test suite repeatedly generates synthetic datasets with known ground-truth metrics and checks empirical confidence interval coverage. If a nominal 95% confidence interval stops covering the true value approximately 95% of the time, CI tests fail. Statistical correctness isn't just documentation—it's enforced in continuous integration. If you're working on: the library also includes disentanglement evaluation metrics. python from reliably.repr import disentanglement results = disentanglement z, factors, metrics= "mig", "sap", "dci", "factorvae", "irs" print results "mig" Output: MIG=0.312 0.271, 0.354 Included metrics: All reported with bootstrap confidence intervals. The project is still in its early stages, and contributions are welcome. GitHub https://github.com/nischal1234/reliably https://github.com/nischal1234/reliably Documentation https://reliably.readthedocs.io https://reliably.readthedocs.io PyPI pip install reliably-metrics Machine learning has become incredibly good at reporting tiny metric improvements. We're much worse at determining whether those improvements are actually real. A model with: AUROC = 0.851 isn't enough. What you really need is: AUROC = 0.851 0.812, 0.887 Because uncertainty isn't optional. It's part of the measurement. Let's make statistically rigorous ML evaluation the default—not the exception.