{"slug": "pre-registering-the-detectable-effect-a-paired-mde-budget-for-4-bit-quantization", "title": "Pre-Registering the Detectable Effect: A Paired-MDE Budget for 4-bit Quantization Benchmarks, with a Pilot Audit", "summary": "A new paired-minimum detectable effect (MDE) budget for 4-bit quantization benchmarks, derived from classical sample-size calculations, provides benchmark designers a one-line formula to pre-register reliability claims before running experiments. A pilot audit of four models and four benchmarks found that most observed differences between FP16 and NF4 quantized models fell below the bound's implied MDE, with cross-split standard deviations within 1.5 percentage points of binomial sampling noise, indicating that much reported \"benchmark unreliability\" on 100-item subsamples is attributable to binomial variance. The method also revealed that prompt-template variation on MMLU (2-10 percentage points) can equal or exceed the largest observed quantization delta (3.2 percentage points), meaning quantization audits must fix the prompt template to avoid absorbing template variance into the noise floor.", "body_md": "arXiv:2605.28873v1 Announce Type: new\nAbstract: This is a planning-method note with an unpaired pilot audit. We adapt the classical paired-binary sample-size calculation (Miettinen, 1968) to quantization benchmarks, giving a conservative minimum detectable effect (MDE) bound $\\delta^{*} \\le (z_{1-\\alpha/2}+z_{1-\\beta})\\sqrt{\\rho_d/m}$ in the paired item count $m$ and the FP16-NF4 disagreement rate $\\rho_d$. The bound turns \"how reliable is my quantization claim?\" into a one-line budget a benchmark designer can commit to before running. We illustrate the bound on four models and four benchmarks ($k=5$ splits of $n=100$), and add a parallel MMLU prompt-template study to put the bound's quantization-noise scale alongside the prompt-noise scale. Assuming $\\rho_d=0.10$ (an unmeasured planning value), all observed NF4-FP16 deltas fall below the implied MDE, and most cross-split SDs lie within $\\pm 1.5$ pp of the binomial reference $\\sqrt{p(1-p)/n}$, so much of the variance reported as \"benchmark unreliability\" on $n=100$ subsamples is binomial sampling noise. The single borderline cell (OPT-WinoGrande, $|\\Delta|=3.2$ pp) is below the implied MDE at $\\rho_d=0.10$ but above it at $\\rho_d=0.05$, illustrating the planning trade-off the bound makes explicit. On MMLU, prompt-template ranges of 2-10 pp meet or exceed the largest observed quantization delta (3.2 pp), so a quantization audit that does not first fix the prompt template absorbs template variance into its noise floor. We complement the bound with a five-line pre-registration template.", "url": "https://wpnews.pro/news/pre-registering-the-detectable-effect-a-paired-mde-budget-for-4-bit-quantization", "canonical_source": "https://arxiv.org/abs/2605.28873", "published_at": "2026-05-29 04:00:00+00:00", "updated_at": "2026-05-29 04:18:05.825406+00:00", "lang": "en", "topics": ["machine-learning", "large-language-models", "ai-research", "natural-language-processing"], "entities": ["OPT", "WinoGrande", "MMLU", "FP16", "NF4", "Miettinen"], "alternates": {"html": "https://wpnews.pro/news/pre-registering-the-detectable-effect-a-paired-mde-budget-for-4-bit-quantization", "markdown": "https://wpnews.pro/news/pre-registering-the-detectable-effect-a-paired-mde-budget-for-4-bit-quantization.md", "text": "https://wpnews.pro/news/pre-registering-the-detectable-effect-a-paired-mde-budget-for-4-bit-quantization.txt", "jsonld": "https://wpnews.pro/news/pre-registering-the-detectable-effect-a-paired-mde-budget-for-4-bit-quantization.jsonld"}}