cd /news/neural-networks/the-weight-norm-sets-the-grokking-ti… · home topics neural-networks article
[ARTICLE · art-27551] src=arxiv.org ↗ pub= topic=neural-networks verified=true sentiment=· neutral

The Weight Norm Sets the Grokking Timescale: A Causal Delay Law

Researchers at arXiv have causally demonstrated that the weight norm sets the grokking timescale in neural networks, settling a dispute over whether weight norm causes the delayed generalization. By intervening on the norm during training, they found that grokking occurs when the weight norm reaches a critical value Wc, and clamping the norm to a fixed multiple of Wc produces an exponential delay law T_grok ∝ exp(αρ) with α ≈ 7.5 across four modular arithmetic tasks. The findings show that holding the norm above Wc slows grokking rather than preventing it, and that LayerNorm removes this dependence.

read1 min publishedJun 15, 2026

arXiv:2606.13753v1 Announce Type: new Abstract: Grokking is the delayed onset of generalization in neural networks, arising long after they fit the training data. Whether the weight norm causes this delay is disputed: some studies report a critical norm at the transition, others observe grokking with no fixed norm at all. We settle this by intervening on the norm during training rather than only observing it. Under free training with weight decay, networks grok when the weight norm reaches a value Wc that varies little across seeds and learning rates (CV 1 to 2 percent) and grows with the modular base as a power law. When we instead clamp the norm to a fixed multiple rho of Wc and hold it there, the network still groks, but the delay follows T_grok proportional to exp(alpha rho). One exponent, alpha near 7.5, fits this delay across four moduli (R^2 = 0.996). Over the swept ranges the held norm moves the delay by about 19x and the learning rate by only about 2x, and holding the norm above Wc slows grokking rather than preventing it. A final LayerNorm removes the dependence by decoupling weight scale from the network function; without it the exponential law returns. This pinned-norm delay is the exponential counterpart to the logarithmic delay predicted for a freely contracting norm.

── more in #neural-networks 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/the-weight-norm-sets…] indexed:0 read:1min 2026-06-15 ·