It’s always the learning rates

wpnews.pro

Pre-training any kind of good LLM is very, very expensive. Thankfully, we have scaling laws. Lilian Weng of Thinky writes: Scaling laws are one of the most critical empirical findings in deep learning. The observation is simple in form: the training loss decreases predictably as we scale up model size

N

, dataset sizeD

, and computeC

, following a power-law curve, which appears as a straight line on a log-log plot. We can view scaling laws as a framework for describing the relationship between compute, loss, model size and data; at its core, it is about how to allocate precious compute optimally betweenN

andD

This predictability makes scaling laws highly valuable in practice. A common workflow is to fit scaling laws on a handful of small runs and then extrapolate to estimate the token and compute requirements for larger models.

Being able to do that reliably was an important discovery, because the general level of understanding of deep models was “huh, guess that worked”. Which got expensive, quickly, if it didn’t.

One important set of experiments is to choose the hyper-parameters, particularly the learning rate, which can have an enormous impact on the model. If you don’t choose the right learning rate you might completely misclassify the value of an architectural change. The general approach is to try a bunch of learning rates while holding D

and C

constant, plot loss against the LR, fit a curve and select the lowest point.1 But still, large models are… larger. The learning rate influences how much the model updates based on the loss from each batch. If you blindly apply the same learning rate from a small model to a big one you will (generally!) get worse results. If you change more parameters for an update you need to reduce the learning rate to make each update-step similar scale. And if you are training with more data, you also need to update less for each batch in order to keep the learning smooth across the run. Different modules in the model will scale up differently, which can, for example, lead to logits exploding in attention blocks.

In 2022, Yang, Hu et al proposed Maximal Update Parametrization (μP) and Hyperparameter Transfer (μTransfer) 2, a recipe for taking a learning rate from a small model and transferring it to a bigger model:

For any fixed family of models with varying width and depth (such as the BERT family or the GPT-3 family), we only need to tune a single small model and can reuse its HPs for all models in the family. For example, we will use this technique to tune BERT-base (110M parameters) and BERT-large (350M parameters) simultaneously by transferring from a 13M model. Lilian’s post isn’t really just about scaling laws though, it’s about how to screw up when using scaling laws:

Despite its clean form, in practice, scaling law fitting can be surprisingly sensitive to seemingly trivial procedural choices, like how you count parameters, how you round the precision, how you sum or average the loss, etc.

Because a scaling law is only fit on the (relatively small, relatively cheap) models that we can afford to train, and the prediction is extrapolated for a model orders of magnitude larger. In such a setup, choices that look like rounding error may lead to wild differences in prediction.

Scaling laws only hold when you keep a lot of things constant, and it can be very easy to either tweak something that breaks your assumptions or take too much confidence in a noisy sample you are going to extrapolate from.

As one example, earlier this year, Zhou, Xing et al. at the Shanghai AI Lab published a paper, How to set the learning rate for large-scale pre-training?, where they attempted to derive useful guidance for something close to the modern LLM recipe: MoEs trained under WSD rather than cosine annealing.3

They spend a bunch of time implementing a solid μTransfer route, then conclude… they shouldn’t. Just fit the LR directly! To make this easier, they cut down the search space:

Use just a handful (7 in their test) of learning rates for each scale.
Train a smaller proportion amount of data (in their case about 25%).
Keep the width-to-depth ratio fixed. They then plot a surface across their different scales, fit a surface, and pick the appropriate LR for their target scale.

What this gets you is a single, global learning rate. Which, surprisingly, works even when extrapolated up to 10x. This is a bit of a departure, and it turns out to work because of the (now-standard) adoption of another3 architectural change: QK-Norm. 4 Since QK-Norm stops the attention logits blowing up, it removes the need for per-module scaling that Yang et al. originally argued for!

One of the consistently surprising things in LLMs is how often you can’t tell how strong a model will until it’s fully trained. Many of the $1B researchers out there are folks who say things like “it’s always the learning rate”, take a look at your loss plot and then fix your training run by normalizing two matrices.

The theory is that loss – log(LR) is invex: any local minimum is also a global minimum, so you can just solve for a stationary point. This is a little bit more general than convex (bowl-shaped): though convex shapes are also invex, invex allows for weird flat spots. Whether it actuallyisinvex as a rule, who knows, but it works well enough: Deepseek found that in there is actually a pretty big valley where all the LRs are kinda fine, so it works in practice. BUT! What you pick might still matter if extrapolate from it to a much larger model, which is pretty much Lilian Weng’s whole point.↩︎ - If this is greek to you, µ is “mu”. ↩︎ - Cosine decay was the general baseline for learning rate annealing: do a warm up to the target learning rate, then decrease it over the training data size. But you need to know the total training data size! But then people started doing massive training runs and wanting to YOLO in data as they went. Warmup Stable Decay training keeps the warmup then just leaves the LR high. You cool it down in a decay phase when you want to use a checkpoint for something. There is a paper that goes into this from Stanford with the subtitle River Valley Loss Perspective, which feels like poetic Chinese, 河谷损失观.↩︎ - They did a great job with the ablations here, so we can be pretty sure that this is the reason! ↩︎

source & further reading

ianbarber.blog — original article LLMs are complicated now FactWorld Somehow, more on distillation

It’s always the learning rates

Run your AI side-project on zahid.host