Pre-Registering the Detectable Effect: A Paired-MDE Budget for 4-bit Quantization Benchmarks, with a Pilot Audit

wpnews.pro

cd /news/machine-learning/pre-registering-the-detectable-effec… · home › topics › machine-learning › article

[ARTICLE · art-17134] src=arxiv.org ↗ pub=2026-05-29T04:00Z topic=machine-learning verified=true sentiment=· neutral

Pre-Registering the Detectable Effect: A Paired-MDE Budget for 4-bit Quantization Benchmarks, with a Pilot Audit

A new paired-minimum detectable effect (MDE) budget for 4-bit quantization benchmarks, derived from classical sample-size calculations, provides benchmark designers a one-line formula to pre-register reliability claims before running experiments. A pilot audit of four models and four benchmarks found that most observed differences between FP16 and NF4 quantized models fell below the bound's implied MDE, with cross-split standard deviations within 1.5 percentage points of binomial sampling noise, indicating that much reported "benchmark unreliability" on 100-item subsamples is attributable to binomial variance. The method also revealed that prompt-template variation on MMLU (2-10 percentage points) can equal or exceed the largest observed quantization delta (3.2 percentage points), meaning quantization audits must fix the prompt template to avoid absorbing template variance into the noise floor.

read1 min views11 publishedMay 29, 2026

arXiv:2605.28873v1 Announce Type: new
Abstract: This is a planning-method note with an unpaired pilot audit. We adapt the classical paired-binary sample-size calculation (Miettinen, 1968) to quantization benchmarks, giving a conservative minimum detectable effect (MDE) bound $\delta^{*} \le (z_{1-\alpha/2}+z_{1-\beta})\sqrt{\rho_d/m}$ in the paired item count $m$ and the FP16-NF4 disagreement rate $\rho_d$. The bound turns "how reliable is my quantization claim?" into a one-line budget a benchmark designer can commit to before running. We illustrate the bound on four models and four benchmarks ($k=5$ splits of $n=100$), and add a parallel MMLU prompt-template study to put the bound's quantization-noise scale alongside the prompt-noise scale. Assuming $\rho_d=0.10$ (an unmeasured planning value), all observed NF4-FP16 deltas fall below the implied MDE, and most cross-split SDs lie within $\pm 1.5$ pp of the binomial reference $\sqrt{p(1-p)/n}$, so much of the variance reported as "benchmark unreliability" on $n=100$ subsamples is binomial sampling noise. The single borderline cell (OPT-WinoGrande, $|\Delta|=3.2$ pp) is below the implied MDE at $\rho_d=0.10$ but above it at $\rho_d=0.05$, illustrating the planning trade-off the bound makes explicit. On MMLU, prompt-template ranges of 2-10 pp meet or exceed the largest observed quantization delta (3.2 pp), so a quantization audit that does not first fix the prompt template absorbs template variance into its noise floor. We complement the bound with a five-line pre-registration template.

source & further reading

arxiv.org — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/pre-registering-the-dete…

Read original on arxiv.org → arxiv.org/abs/2605.28873

mentioned entities

OPT

WinoGrande

MMLU

FP16

NF4

Miettinen

metadata

slugpre-registering-the-detectable-effect-a-paired-mde-budget-for-4-bit-quantization

topic#machine-learning

secondary3 topics

sentimentneutral

canonicalarxiv.org

navigation

← prevChatGPT glitch is leaking OpenAI…

next →New infosec products of the mont…

── more in #machine-learning 4 stories · sorted by recency

vettedconsumer.com · 6 Jun · #machine-learning

GGUF vs. GPTQ vs. AWQ: The Plain-English Guide to LLM Quantization

spectrum.ieee.org · 15 Jul · #machine-learning

The First Chatbot’s Multiple Personalities

sourcefeed.dev · 15 Jul · #machine-learning

DSLs Make LLM Code Generation Production-Ready

dev.to · 15 Jul · #machine-learning

The Trillion-Parameter RL Paper Is Really About Letting the Model Find the Workflow

── more on @opt 3 stories trending now

wpnews · 27 May · #artificial-intelligence

How I Run Two Claude Accounts as One

wpnews · 23 May · #artificial-intelligence

AccessLens — a blind person's lanyard, powered by Gemma 4 on-device

wpnews · 21 May · #developer-tools

Antigravity CLI: A Hands-On Guide to Google's Terminal Coding Agent

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required