The Weight Norm Sets the Grokking Timescale: A Causal Delay Law

Researchers at arXiv have causally demonstrated that the weight norm sets the grokking timescale in neural networks, settling a dispute over whether weight norm causes the delayed generalization. By intervening on the norm during training, they found that grokking occurs when the weight norm reaches a critical value Wc, and clamping the norm to a fixed multiple of Wc produces an exponential delay law T_grok ∝ exp(αρ) with α ≈ 7.5 across four modular arithmetic tasks. The findings show that holding the norm above Wc slows grokking rather than preventing it, and that LayerNorm removes this dependence.

arXiv:2606.13753v1 Announce Type: new Abstract: Grokking is the delayed onset of generalization in neural networks, arising long after they fit the training data. Whether the weight norm causes this delay is disputed: some studies report a critical norm at the transition, others observe grokking with no fixed norm at all. We settle this by intervening on the norm during training rather than only observing it. Under free training with weight decay, networks grok when the weight norm reaches a value Wc that varies little across seeds and learning rates CV 1 to 2 percent and grows with the modular base as a power law. When we instead clamp the norm to a fixed multiple rho of Wc and hold it there, the network still groks, but the delay follows T grok proportional to exp alpha rho . One exponent, alpha near 7.5, fits this delay across four moduli R^2 = 0.996 . Over the swept ranges the held norm moves the delay by about 19x and the learning rate by only about 2x, and holding the norm above Wc slows grokking rather than preventing it. A final LayerNorm removes the dependence by decoupling weight scale from the network function; without it the exponential law returns. This pinned-norm delay is the exponential counterpart to the logarithmic delay predicted for a freely contracting norm.