Every week, somewhere, a team makes a deployment decision that looks like this:
Model A: AUROC = 0.847
Model B: AUROC = 0.851
They ship Model B.
Maybe it's better.
Maybe it's noise.
Nobody knows—because nobody computed a confidence interval.
That's exactly why I built reliably-metrics.
Most ML evaluation today looks like this:
print(f"AUROC = {auroc:.4f}")
Output:
AUROC = 0.8512
Looks precise.
Looks scientific.
But it tells you almost nothing about uncertainty.
Metrics are estimates computed from finite samples. Without uncertainty quantification, you're making decisions using a single point estimate and hoping it's representative.
Consider two models evaluated on 500 test samples:
Model A: AUROC = 0.847
Model B: AUROC = 0.851
Difference = +0.004
Is that improvement real?
Or would it disappear if you collected another batch of test data?
Most ML tooling doesn't answer that question.
reliably-metrics
pip install reliably-metrics
Basic evaluation:
import reliably as rb
report = rb.evaluate(y_true, y_prob)
print(report.summary())
Output:
Report(task=binary, n=500)
ECE=0.0412 [0.0287, 0.0541]
smECE=0.0389 [0.0261, 0.0523]
Brier=0.1834 [0.1612, 0.2063]
NLL=0.4821 [0.4503, 0.5148]
AUROC=0.8234 [0.7941, 0.8509]
Notice something different?
Every metric comes with a 95% confidence interval.
No extra code.
No manual bootstrap implementation.
No statistics package required.
Instead of comparing raw metric values, compare uncertainty-aware estimates.
result = rb.compare(
model_a,
model_b,
metric="auroc",
y_true=y_true
)
print(f"Delta: {result.delta:+.4f}")
print(f"95% CI: [{result.ci.low:.4f}, {result.ci.high:.4f}]")
print(f"p-value: {result.p_value:.4f}")
print(f"Significant: {result.significant}")
Output:
Delta: +0.0182
95% CI: [-0.0031, 0.0396]
p-value: 0.094
Significant: False
Interpretation:
Translation:
Don't deploy Model B yet.
The library automatically selects the appropriate test:
| Metric | Statistical Method |
|---|---|
| AUROC | DeLong Test |
| Other Metrics | Paired Bootstrap |
| Multiple Comparisons | Holm–Bonferroni Correction |
A model can have excellent accuracy while being poorly calibrated.
If your model outputs:
predict_proba = 0.90
it should be correct approximately 90% of the time.
In practice, many production systems are far from this ideal.
report_before = rb.evaluate(
y_true,
y_prob
)
print(report_before["ECE"])
Output:
ECE=0.0821 [0.0612, 0.1034]
cal = rb.recalibrate(
y_true,
y_prob,
method="temperature"
)
y_prob_cal = cal.predict(y_prob_test)
report_after = rb.evaluate(
y_true_test,
y_prob_cal
)
print(report_after["ECE"])
Output:
ECE=0.0241 [0.0143, 0.0352]
Supported methods:
Most calibration plots show a line and leave interpretation to the reader.
reliably-metrics
can visualize uncertainty directly.
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(6, 6))
report.reliability_diagram(
y_true,
y_prob,
ax=ax,
band=True
)
plt.savefig(
"calibration.png",
dpi=150
)
The shaded region represents a bootstrap confidence band around the calibration curve.
This helps distinguish real calibration errors from random fluctuations.
Need a report for teammates or stakeholders?
report.to_html(
path="model_report.html"
)
That's it.
The generated report contains:
No Jupyter notebook required.
Core installation:
pip install reliably-metrics
Visualization support:
pip install reliably-metrics[viz]
HTML reporting:
pip install reliably-metrics[report]
Everything:
pip install reliably-metrics[all]
Heavy dependencies are loaded only when needed.
Traditional bootstrap implementations often look like this:
for i in range(10000):
sample = resample(data)
metric = compute_metric(sample)
That means 10,000 Python loops.
reliably-metrics
instead generates all bootstrap indices up front and performs calculations using vectorized NumPy operations.
The result:
Every stochastic operation accepts an explicit seed.
report = rb.evaluate(
y_true,
y_prob,
seed=42
)
Same data.
Same seed.
Same output.
Always.
Many libraries claim statistical rigor.
We verify it.
The test suite repeatedly generates synthetic datasets with known ground-truth metrics and checks empirical confidence interval coverage.
If a nominal 95% confidence interval stops covering the true value approximately 95% of the time, CI tests fail.
Statistical correctness isn't just documentation—it's enforced in continuous integration.
If you're working on:
the library also includes disentanglement evaluation metrics.
from reliably.repr import disentanglement
results = disentanglement(
z,
factors,
metrics=(
"mig",
"sap",
"dci",
"factorvae",
"irs"
)
)
print(results["mig"])
Output:
MIG=0.312 [0.271, 0.354]
Included metrics:
All reported with bootstrap confidence intervals.
The project is still in its early stages, and contributions are welcome.
GitHub
https://github.com/nischal1234/reliably
Documentation
https://reliably.readthedocs.io
PyPI
pip install reliably-metrics
Machine learning has become incredibly good at reporting tiny metric improvements.
We're much worse at determining whether those improvements are actually real.
A model with:
AUROC = 0.851
isn't enough.
What you really need is:
AUROC = 0.851 [0.812, 0.887]
Because uncertainty isn't optional.
It's part of the measurement.
Let's make statistically rigorous ML evaluation the default—not the exception.