Bootstrap confidence intervals for your LLM eval metrics

Nexus Labs' fine-tuning and evaluation team lead demonstrated that a single evaluation metric like 84.2% accuracy on a 500-example set carries significant uncertainty, with a 95% bootstrap confidence interval spanning roughly 3 points on each side. Using bootstrap resampling, the team found that a 1.5-point gap between two model checkpoints fell within the noise, preventing a premature promotion to staging. The lead advocates for reporting confidence intervals and using paired bootstrap tests when comparing models, noting that interval width scales with the square root of sample size.

TL;DR: A single eval number hides its own uncertainty. Eval confidence intervals from bootstrap resampling turn a point estimate like 84.2% accuracy into a range, so you stop shipping models on a difference that is noise. Two checkpoints came back from a fine-tuning run at 84.2% and 85.7% on our 500-example agent eval set. The 1.5 point gap read like a win, and someone wanted to promote the second checkpoint to staging. Before that, I wanted eval confidence intervals on both numbers, because a 500-example set carries more sampling error than most teams admit. At 500 examples, the 95% interval on a single accuracy near 85% spans roughly 3 points on each side. The win sat well inside the noise. I lead the fine-tuning and evaluation team at Nexus Labs, and the most common mistake I see is treating an eval score as exact. It isn't. Your eval set is a sample drawn from the input space you care about, and a different 500 examples would return a different number. Confidence intervals make that variance visible. An eval confidence interval is a range around a metric, like accuracy or F1, that quantifies how much the score would move if you resampled the eval set. A 95% bootstrap interval of 81.0%, 87.1% means that across thousands of resamples of your data, 95% of the recomputed scores fell in that band. It measures sampling noise, not model quality. That distinction matters. Two checkpoints scoring 84.2% and 85.7% with overlapping intervals are, as far as your eval set can tell, indistinguishable. Card et al. showed in "With Little Power Comes Great Responsibility" https://aclanthology.org/2020.emnlp-main.745/ that many NLP experiments are underpowered to detect the effect sizes they report. The bootstrap is resampling with replacement. You take your per-example results, draw N of them with replacement many times, recompute the metric each time, and read percentiles off the resulting distribution. There's no assumption that the metric is normally distributed. python import numpy as np per-example correctness, 1 = pass, 0 = fail results = np.array eval pass flags shape 500, def bootstrap ci x, n boot=10 000, alpha=0.05 : n = len x rng = np.random.default rng 0 means = np.empty n boot for i in range n boot : sample = x rng.integers 0, n, n means i = sample.mean lo = np.percentile means, 100 alpha / 2 hi = np.percentile means, 100 1 - alpha / 2 return x.mean , lo, hi print bootstrap ci results 0.842, 0.806, 0.876 scipy ships scipy.stats.bootstrap if you'd rather not hand-roll it. For 500 examples and 10,000 resamples this runs in under a second, so there's no cost excuse to skip it. When comparing two checkpoints, don't bootstrap each interval separately and check for overlap. Overlapping intervals can still hide a real difference. Use a paired bootstrap: resample the example indices once per iteration, score both models on the same indices, and record the difference. python def paired bootstrap a, b, n boot=10 000 : n = len a rng = np.random.default rng 0 diffs = np.empty n boot for i in range n boot : idx = rng.integers 0, n, n diffs i = a idx .mean - b idx .mean return np.percentile diffs, 2.5, 97.5 If that interval on the difference contains zero, you can't claim the second checkpoint is better. On our 1.5 point gap it ran from -1.9% to +4.8%. Zero is in the band, so we did not promote. Dror et al.'s "Hitchhiker's Guide to Testing Statistical Significance in NLP" https://aclanthology.org/P18-1128/ covers when paired tests apply and which to pick. Interval width shrinks with the square root of N, so halving it costs four times the labeled data. At 500 examples a near-85% metric carries about plus or minus 3 points; reaching plus or minus 1.5 needs roughly 2,000 labeled examples. That is the real budgeting question for an eval set, and it's why I push for fewer, higher-quality, well-stratified examples instead of chasing a round number. For rare failure modes the picture is worse. A category with 20 examples in your set has an interval so wide it tells you almost nothing, which is how aggregate scores stay stable while a subpopulation quietly regresses. The bootstrap assumes your eval examples are independent and drawn from the distribution you care about. If they cluster multiple turns from one conversation, or near-duplicate prompts , the effective sample size is smaller than N and your interval comes out too narrow. Dedup first. It also only measures sampling noise. It says nothing about label error, distribution shift between your eval set and production traffic, or a judge model that's miscalibrated. A tight interval on a biased metric is still wrong, only now you're confident in it. For very low pass rates the percentile bootstrap can misbehave; bias-corrected and accelerated BCa intervals are better there but slower to compute. Eval confidence intervals are the cheapest reliability upgrade available to an ML team. A dozen lines of NumPy turns every score into a score plus a band, and the band is usually wider than the gap you were about to ship on. Next time a checkpoint wins by a point or two, run the paired bootstrap before you tell anyone. The honest answer is often "we can't tell yet, label more data."