Pruning Deep Neural Networks via the Marchenko--Pastur Distribution

Researchers have developed a Marchenko-Pastur (MP) random-matrix approach for pruning deep neural networks that requires minimal post-pruning fine-tuning. The method provides deterministic data-path certificates for accuracy retention, demonstrating that if a removed component has a small propagated logit effect, pruning preserves samples whose dense margin exceeds twice the perturbation. On ImageNet-1k, the technique achieved ViT-B/16 top-1 accuracy of 83.41% with 59.81% sparse-execution MAC reduction after only three distillation epochs, and delivered up to 2.705x backend speedup on A100 hardware.

arXiv:2606.02608v1 Announce Type: new Abstract: We study a Marchenko--Pastur MP random-matrix approach to pruning deep neural networks with very small post-pruning fine-tuning budgets. The main practical contribution is accuracy retention under short calibration and fine-tuning schedules, rather than a long post-pruning reoptimization pipeline. The theory gives deterministic data-path certificates: if the removed component $R$ has small propagated logit effect $L s \| R \psi 1 s \| \infty$, pruning decreases an elastic-net objective and preserves samples whose dense margin exceeds twice the perturbation. The zero-budget case gives perfect pruning; a prune--restore extension models weight restoration inside a fixed sparse-execution pattern; and an additive $L 2$-regularized model shows admissible random-like components vanish at the training limit, with persistent spikes stabilizing as the MP bulk collapses. Under iid-Gaussian sufficient conditions, the fitted MP edge $\sigma +$ gives a high-probability layerwise budget signal. On ImageNet-1k, after only three distillation epochs, ViT-B/16 $2{:}4{+}$ToMe reaches $83.41\%$ top-1 $-1.70$ pp from dense at $59.81\%$ sparse-execution MAC reduction, with $1.388\times$ best-observed A40 native-$2{:}4$ backend speedup for the same checkpoint and ToMe graph; a separate no-ToMe A100 endpoint gives $2.705\times$. At structured sparsity, ViT-B/16 $6{:}12$ reaches $83.74\%$, ViT-L/16 $8{:}16$ dense+permutation reaches $85.33\%$ $-0.51$ pp , and ConvNeXtV2-Base $12{:}16$ reaches $86.35\%$ $-0.37$ pp . For CNNs, ResNet50 $8{:}16$ dense+permutation reaches $75.87\%$ $-0.26$ pp , and ResNet152d CAST-conv+permutation reaches $81.33\%$ $-1.53$ pp at ${\sim}50\%$ MAC accounting with a $1.62\times$ A40 im2col$+2{:}4$ sparse-GEMM audit.