{"slug": "stop-shipping-ml-models-with-bare-floats-a-deep-dive-into-statistically-rigorous", "title": "Stop Shipping ML Models With Bare Floats: A Deep Dive Into Statistically Rigorous Model Evaluation", "summary": "A developer built reliably-metrics, an open-source Python library that adds confidence intervals and statistical significance tests to common ML evaluation metrics like AUROC, ECE, and Brier score. The library automatically computes 95% confidence intervals and performs appropriate statistical tests (e.g., DeLong test for AUROC) to help teams make deployment decisions based on uncertainty-aware estimates rather than raw point estimates. It also supports calibration evaluation and recalibration, reliability diagrams with uncertainty bands, and generates HTML reports.", "body_md": "Every week, somewhere, a team makes a deployment decision that looks like this:\n\n```\nModel A: AUROC = 0.847\nModel B: AUROC = 0.851\n```\n\nThey ship Model B.\n\nMaybe it's better.\n\nMaybe it's noise.\n\nNobody knows—because nobody computed a confidence interval.\n\nThat's exactly why I built **reliably-metrics**.\n\nMost ML evaluation today looks like this:\n\n```\nprint(f\"AUROC = {auroc:.4f}\")\n```\n\nOutput:\n\n```\nAUROC = 0.8512\n```\n\nLooks precise.\n\nLooks scientific.\n\nBut it tells you almost nothing about uncertainty.\n\nMetrics are estimates computed from finite samples. Without uncertainty quantification, you're making decisions using a single point estimate and hoping it's representative.\n\nConsider two models evaluated on 500 test samples:\n\n```\nModel A: AUROC = 0.847\nModel B: AUROC = 0.851\nDifference = +0.004\n```\n\nIs that improvement real?\n\nOr would it disappear if you collected another batch of test data?\n\nMost ML tooling doesn't answer that question.\n\n`reliably-metrics`\n\n```\npip install reliably-metrics\n```\n\nBasic evaluation:\n\n``` python\nimport reliably as rb\n\nreport = rb.evaluate(y_true, y_prob)\n\nprint(report.summary())\n```\n\nOutput:\n\n```\nReport(task=binary, n=500)\n  ECE=0.0412 [0.0287, 0.0541]\n  smECE=0.0389 [0.0261, 0.0523]\n  Brier=0.1834 [0.1612, 0.2063]\n  NLL=0.4821 [0.4503, 0.5148]\n  AUROC=0.8234 [0.7941, 0.8509]\n```\n\nNotice something different?\n\nEvery metric comes with a 95% confidence interval.\n\nNo extra code.\n\nNo manual bootstrap implementation.\n\nNo statistics package required.\n\nInstead of comparing raw metric values, compare uncertainty-aware estimates.\n\n```\nresult = rb.compare(\n    model_a,\n    model_b,\n    metric=\"auroc\",\n    y_true=y_true\n)\n\nprint(f\"Delta: {result.delta:+.4f}\")\nprint(f\"95% CI: [{result.ci.low:.4f}, {result.ci.high:.4f}]\")\nprint(f\"p-value: {result.p_value:.4f}\")\nprint(f\"Significant: {result.significant}\")\n```\n\nOutput:\n\n```\nDelta: +0.0182\n95% CI: [-0.0031, 0.0396]\np-value: 0.094\nSignificant: False\n```\n\nInterpretation:\n\nTranslation:\n\nDon't deploy Model B yet.\n\nThe library automatically selects the appropriate test:\n\n| Metric | Statistical Method |\n|---|---|\n| AUROC | DeLong Test |\n| Other Metrics | Paired Bootstrap |\n| Multiple Comparisons | Holm–Bonferroni Correction |\n\nA model can have excellent accuracy while being poorly calibrated.\n\nIf your model outputs:\n\n```\npredict_proba = 0.90\n```\n\nit should be correct approximately 90% of the time.\n\nIn practice, many production systems are far from this ideal.\n\n```\nreport_before = rb.evaluate(\n    y_true,\n    y_prob\n)\n\nprint(report_before[\"ECE\"])\n```\n\nOutput:\n\n```\nECE=0.0821 [0.0612, 0.1034]\ncal = rb.recalibrate(\n    y_true,\n    y_prob,\n    method=\"temperature\"\n)\n\ny_prob_cal = cal.predict(y_prob_test)\nreport_after = rb.evaluate(\n    y_true_test,\n    y_prob_cal\n)\n\nprint(report_after[\"ECE\"])\n```\n\nOutput:\n\n```\nECE=0.0241 [0.0143, 0.0352]\n```\n\nSupported methods:\n\nMost calibration plots show a line and leave interpretation to the reader.\n\n`reliably-metrics`\n\ncan visualize uncertainty directly.\n\n``` python\nimport matplotlib.pyplot as plt\n\nfig, ax = plt.subplots(figsize=(6, 6))\n\nreport.reliability_diagram(\n    y_true,\n    y_prob,\n    ax=ax,\n    band=True\n)\n\nplt.savefig(\n    \"calibration.png\",\n    dpi=150\n)\n```\n\nThe shaded region represents a bootstrap confidence band around the calibration curve.\n\nThis helps distinguish real calibration errors from random fluctuations.\n\nNeed a report for teammates or stakeholders?\n\n```\nreport.to_html(\n    path=\"model_report.html\"\n)\n```\n\nThat's it.\n\nThe generated report contains:\n\nNo Jupyter notebook required.\n\nCore installation:\n\n```\npip install reliably-metrics\n```\n\nVisualization support:\n\n```\npip install reliably-metrics[viz]\n```\n\nHTML reporting:\n\n```\npip install reliably-metrics[report]\n```\n\nEverything:\n\n```\npip install reliably-metrics[all]\n```\n\nHeavy dependencies are loaded only when needed.\n\nTraditional bootstrap implementations often look like this:\n\n```\nfor i in range(10000):\n    sample = resample(data)\n    metric = compute_metric(sample)\n```\n\nThat means 10,000 Python loops.\n\n`reliably-metrics`\n\ninstead generates all bootstrap indices up front and performs calculations using vectorized NumPy operations.\n\nThe result:\n\nEvery stochastic operation accepts an explicit seed.\n\n```\nreport = rb.evaluate(\n    y_true,\n    y_prob,\n    seed=42\n)\n```\n\nSame data.\n\nSame seed.\n\nSame output.\n\nAlways.\n\nMany libraries claim statistical rigor.\n\nWe verify it.\n\nThe test suite repeatedly generates synthetic datasets with known ground-truth metrics and checks empirical confidence interval coverage.\n\nIf a nominal 95% confidence interval stops covering the true value approximately 95% of the time, CI tests fail.\n\nStatistical correctness isn't just documentation—it's enforced in continuous integration.\n\nIf you're working on:\n\nthe library also includes disentanglement evaluation metrics.\n\n``` python\nfrom reliably.repr import disentanglement\n\nresults = disentanglement(\n    z,\n    factors,\n    metrics=(\n        \"mig\",\n        \"sap\",\n        \"dci\",\n        \"factorvae\",\n        \"irs\"\n    )\n)\n\nprint(results[\"mig\"])\n```\n\nOutput:\n\n```\nMIG=0.312 [0.271, 0.354]\n```\n\nIncluded metrics:\n\nAll reported with bootstrap confidence intervals.\n\nThe project is still in its early stages, and contributions are welcome.\n\n**GitHub**\n\n[https://github.com/nischal1234/reliably](https://github.com/nischal1234/reliably)\n\n**Documentation**\n\n[https://reliably.readthedocs.io](https://reliably.readthedocs.io)\n\n**PyPI**\n\n```\npip install reliably-metrics\n```\n\nMachine learning has become incredibly good at reporting tiny metric improvements.\n\nWe're much worse at determining whether those improvements are actually real.\n\nA model with:\n\n```\nAUROC = 0.851\n```\n\nisn't enough.\n\nWhat you really need is:\n\n```\nAUROC = 0.851 [0.812, 0.887]\n```\n\nBecause uncertainty isn't optional.\n\nIt's part of the measurement.\n\nLet's make statistically rigorous ML evaluation the default—not the exception.", "url": "https://wpnews.pro/news/stop-shipping-ml-models-with-bare-floats-a-deep-dive-into-statistically-rigorous", "canonical_source": "https://dev.to/nischal_mandal_bc08e73405/stop-shipping-ml-models-with-bare-floats-a-deep-dive-into-statistically-rigorous-model-evaluation-394p", "published_at": "2026-06-15 19:44:49+00:00", "updated_at": "2026-06-15 20:02:38.461141+00:00", "lang": "en", "topics": ["machine-learning", "developer-tools", "mlops"], "entities": ["reliably-metrics", "DeLong Test", "Holm–Bonferroni Correction"], "alternates": {"html": "https://wpnews.pro/news/stop-shipping-ml-models-with-bare-floats-a-deep-dive-into-statistically-rigorous", "markdown": "https://wpnews.pro/news/stop-shipping-ml-models-with-bare-floats-a-deep-dive-into-statistically-rigorous.md", "text": "https://wpnews.pro/news/stop-shipping-ml-models-with-bare-floats-a-deep-dive-into-statistically-rigorous.txt", "jsonld": "https://wpnews.pro/news/stop-shipping-ml-models-with-bare-floats-a-deep-dive-into-statistically-rigorous.jsonld"}}