cd /news/machine-learning/pre-registering-the-detectable-effec… · home topics machine-learning article
[ARTICLE · art-17134] src=arxiv.org pub= topic=machine-learning verified=true sentiment=· neutral

Pre-Registering the Detectable Effect: A Paired-MDE Budget for 4-bit Quantization Benchmarks, with a Pilot Audit

A new paired-minimum detectable effect (MDE) budget for 4-bit quantization benchmarks, derived from classical sample-size calculations, provides benchmark designers a one-line formula to pre-register reliability claims before running experiments. A pilot audit of four models and four benchmarks found that most observed differences between FP16 and NF4 quantized models fell below the bound's implied MDE, with cross-split standard deviations within 1.5 percentage points of binomial sampling noise, indicating that much reported "benchmark unreliability" on 100-item subsamples is attributable to binomial variance. The method also revealed that prompt-template variation on MMLU (2-10 percentage points) can equal or exceed the largest observed quantization delta (3.2 percentage points), meaning quantization audits must fix the prompt template to avoid absorbing template variance into the noise floor.

read1 min publishedMay 29, 2026
arXiv:2605.28873v1 Announce Type: new
Abstract: This is a planning-method note with an unpaired pilot audit. We adapt the classical paired-binary sample-size calculation (Miettinen, 1968) to quantization benchmarks, giving a conservative minimum detectable effect (MDE) bound $\delta^{*} \le (z_{1-\alpha/2}+z_{1-\beta})\sqrt{\rho_d/m}$ in the paired item count $m$ and the FP16-NF4 disagreement rate $\rho_d$. The bound turns "how reliable is my quantization claim?" into a one-line budget a benchmark designer can commit to before running. We illustrate the bound on four models and four benchmarks ($k=5$ splits of $n=100$), and add a parallel MMLU prompt-template study to put the bound's quantization-noise scale alongside the prompt-noise scale. Assuming $\rho_d=0.10$ (an unmeasured planning value), all observed NF4-FP16 deltas fall below the implied MDE, and most cross-split SDs lie within $\pm 1.5$ pp of the binomial reference $\sqrt{p(1-p)/n}$, so much of the variance reported as "benchmark unreliability" on $n=100$ subsamples is binomial sampling noise. The single borderline cell (OPT-WinoGrande, $|\Delta|=3.2$ pp) is below the implied MDE at $\rho_d=0.10$ but above it at $\rho_d=0.05$, illustrating the planning trade-off the bound makes explicit. On MMLU, prompt-template ranges of 2-10 pp meet or exceed the largest observed quantization delta (3.2 pp), so a quantization audit that does not first fix the prompt template absorbs template variance into its noise floor. We complement the bound with a five-line pre-registration template.
── more in #machine-learning 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/pre-registering-the-…] indexed:0 read:1min 2026-05-29 ·